[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139147#comment-15139147 ] Jeremiah Jordan commented on CASSANDRA-9231: I think we probably have other issues to solve besides CASSANDRA-9754 for multi-GB partitions to be viable? Are you not going to still have operational issues around repairing them and compacting them still? > Support Routing Key as part of Partition Key > > > Key: CASSANDRA-9231 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 > Project: Cassandra > Issue Type: Wish >Reporter: Matthias Broecheler > > Provide support for sub-dividing the partition key into a routing key and a > non-routing key component. Currently, all columns that make up the partition > key of the primary key are also routing keys, i.e. they determine which nodes > store the data. This proposal would give the data modeler the ability to > designate only a subset of the columns that comprise the partition key to be > routing keys. The non-routing key columns of the partition key identify the > partition but are not used to determine where to store the data. > Consider the following example table definition: > CREATE TABLE foo ( > a int, > b int, > c int, > d int, > PRIMARY KEY (([a], b), c ) ); > (a,b) is the partition key, c is the clustering key, and d is just a column. > In addition, the square brackets identify the routing key as column a. This > means that only the value of column a is used to determine the node for data > placement (i.e. only the value of column a is murmur3 hashed to compute the > token). In addition, column b is needed to identify the partition but does > not influence the placement. > This has the benefit that all rows with the same routing key (but potentially > different non-routing key columns of the partition key) are stored on the > same node and that knowledge of such co-locality can be exploited by > applications build on top of Cassandra. > Currently, the only way to achieve co-locality is within a partition. > However, this approach has the limitations that: a) there are theoretical and > (more importantly) practical limitations on the size of a partition and b) > rows within a partition are ordered and an index is build to exploit such > ordering. For large partitions that overhead is significant if ordering isn't > needed. > In other words, routing keys afford a simple means to achieve scalable > node-level co-locality without ordering while clustering keys afford > page-level co-locality with ordering. As such, they address different > co-locality needs giving the data modeler the flexibility to choose what is > needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139158#comment-15139158 ] Jonathan Ellis commented on CASSANDRA-9231: --- Repair: shouldn't be an issue now that we have incremental mode. Compaction: unclear how much extra write amplification will happen vs having them in separate partitions but same machine. (vnode-based compaction doesn't help with either one.) On balance I'd say we'd be well served by improving compaction in general. > Support Routing Key as part of Partition Key > > > Key: CASSANDRA-9231 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 > Project: Cassandra > Issue Type: Wish >Reporter: Matthias Broecheler > > Provide support for sub-dividing the partition key into a routing key and a > non-routing key component. Currently, all columns that make up the partition > key of the primary key are also routing keys, i.e. they determine which nodes > store the data. This proposal would give the data modeler the ability to > designate only a subset of the columns that comprise the partition key to be > routing keys. The non-routing key columns of the partition key identify the > partition but are not used to determine where to store the data. > Consider the following example table definition: > CREATE TABLE foo ( > a int, > b int, > c int, > d int, > PRIMARY KEY (([a], b), c ) ); > (a,b) is the partition key, c is the clustering key, and d is just a column. > In addition, the square brackets identify the routing key as column a. This > means that only the value of column a is used to determine the node for data > placement (i.e. only the value of column a is murmur3 hashed to compute the > token). In addition, column b is needed to identify the partition but does > not influence the placement. > This has the benefit that all rows with the same routing key (but potentially > different non-routing key columns of the partition key) are stored on the > same node and that knowledge of such co-locality can be exploited by > applications build on top of Cassandra. > Currently, the only way to achieve co-locality is within a partition. > However, this approach has the limitations that: a) there are theoretical and > (more importantly) practical limitations on the size of a partition and b) > rows within a partition are ordered and an index is build to exploit such > ordering. For large partitions that overhead is significant if ordering isn't > needed. > In other words, routing keys afford a simple means to achieve scalable > node-level co-locality without ordering while clustering keys afford > page-level co-locality with ordering. As such, they address different > co-locality needs giving the data modeler the flexibility to choose what is > needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586256#comment-14586256 ] Benjamin Coverston commented on CASSANDRA-9231: --- I'm also -1 on adding UDFs into the mix, just on the merits of losing the token aware routing from the client. A simple designation of a some of the partition keys as routing keys would serve the use cases I'm aware of. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534202#comment-14534202 ] Sylvain Lebresne commented on CASSANDRA-9231: - What I'm talking about is basically the idea of CASSANDRA-5054. Or to put it another way, we could use a function like: {noformat} CREATE FUNCTION myTokenFct(a int, b int) RETURNS bigint AS $$ long high = murmur3(a); long low = murmur3(b); return (high 0x) | (low 0x); $$; {noformat} The goal being to make it likely that partitions with the same value for {{a}} are on a small amount of nodes but without forcing everything on the same node (the latter having a fair amount of foot-shooting potential). But that's really just an example. You could imagine to actually have a specific table that is ordered (in a predictable way) without having to use {{ByteOrderPartitioner}} for the whole cluster: {noformat} CREATE FUNCTION myOrderedTokenFct(a bigint) RETURNS bigint AS 'return a'; CREATE TABLE t ( a int PRIMARY KEY, b text, c text ) with tokenizer=myOrderedTokenFct; {noformat} Basically, this gets you very close to a per-table partitioner. The actual partitioner would just define the domain of the tokens and how they sort, but the actual computation would be per-table. And this for very, very little change to the syntax and barely more complexity code-wise than the routing key idea. Of course, this will be an advanced feature that people should use at their own risk. But that's true of the routing key idea too: we'd better label it as an advanced feature or I'm certain people will misuse it and shoot themselves in the foot more often than not. This is also why I'm not too worried about the drivers parts: it's simple to say that if you use a custom token function, which will be rare in the first place, then you have to provide it to the driver too to get token awareness (which is not saying that this isn't a small downside, but it's a very small one in practice and given the context). Perhaps more importantly, I think the function idea is conceptually *simpler* than the routing key idea. All that you basically have to say is that we allow you to define the {{token}} function on a per-table basis, the exact same function that already exists and can be used in {{SELECT}}. While the routing key concept (or whatever name we would pick) is imo more confusing. You have to explain that on top of the _primary key_ having a subpart that is the _partition key_, you also have a subpart of the latter which is now the _routing key_. And how do you define what the _partition key_ is now in simple terms? Well, I don't know, because once you have a routing key that is different from the partition key, the partition key start to be kind of an implementation detail. It's the thing that don't really determine where the row is distributed, but is not part of the clustering so you can't query it like a clustering column because ... because? Honestly, allowing to provide custom {{token}} function per table is 1) more powerful and 2) imo way more easy to explain conceptually and this without fuzzing existing concept. So I'm a -1 on the routing key concept unless I'm proved that the custom {{token}} function idea doesn't work, is substantially more complex to implement or has fundamental flaws I have missed. I would hate to add the routing key idea to realize that some other user has a clever routing idea that is just not handled by the routing key (and having to add some new custom concept). bq. the distinct concept of token (which is more an implementation detail, IMO) Your opinion are your own, but the token is most definitively *not* an implementation detail since 1) we have a {{token}} function in CQL to compute it and 2) we reference it all the time in the documentation, have scores of options that mention it, it's exposed by drivers, etc... Actually, the fact that we would use the token concept rather than adding a new custom one is part of why I'm convinced it's conceptually simpler: everyone that knows Cassandra knows of tokens. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534220#comment-14534220 ] Benedict commented on CASSANDRA-9231: - The token is an implementation detail for the _concept_ of routing, or fair distribution. Perhaps we have different definitions of implementation detail, but I stand by it under my nomenclature, and the presence of a {{token}} function doesn't really change that. My point is that from a data modelling perspective, being able to define the values on which you distribute is the concept you care about. The token that is ultimately used to deliver that is not important for you when modelling your system. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535326#comment-14535326 ] Benedict commented on CASSANDRA-9231: - bq. They wouldn't be providing arbitrary tokens, they would be providing arbitrary input to the hash function (for Random, MP3). {code} CREATE FUNCTION myOrderedTokenFct(a bigint) RETURNS bigint AS 'return a'; CREATE TABLE t ( a int PRIMARY KEY, b text, c text ) with tokenizer=myOrderedTokenFct; {code} bq. Basically, this gets you very close to a per-table partitioner. The actual partitioner would just define the domain of the tokens and how they sort, but the actual computation would be per-table. And this for very, very little change to the syntax and barely more complexity code-wise than the routing key idea. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535319#comment-14535319 ] Tyler Hobbs commented on CASSANDRA-9231: bq. However I would point out that letting the user provide an arbitrary token lets them, for instance, break the order preserving assumptions of BOP, or the fair distribution assumptions of the hash partitioner. They wouldn't be providing arbitrary tokens, they would be providing arbitrary input to the hash function (for Random, MP3). The distribution would be approximately as fair as it would be without the transform step. For BOP they would maintain the order of whatever the function returns, which makes sense and seems like exactly what the user would want. FWIW, I agree with Sylvain's preference for using functions rather than a routing key, for the same reasons he lists. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534445#comment-14534445 ] Aleksey Yeschenko commented on CASSANDRA-9231: -- As it stands now, I'm -1 on involving UDFs here. The use case I have in mind is the only *real* use case I've heard, from just 2 users. They'd be better served by the less complicated designation of some of the partition key columns for calculating the token and don't need this extra power. Don't have much to add, otherwise. The ticket is not - yet - urgent, there is at least a few months ahead before starting to work on it. I'm going to wait for some other use cases before I'm convinced that the full UDF approach makes any sense here, and put this issue on hold otherwise. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534448#comment-14534448 ] Benedict commented on CASSANDRA-9231: - I think we're just making the same arguments back and forth, so I'll leave it here for now. However I would point out that letting the user provide an arbitrary token lets them, for instance, break the order preserving assumptions of BOP, or the fair distribution assumptions of the hash partitioner. This latter in particular could lead to many future optimizations (e.g. CASSANDRA-7282) instead degrading such a cluster. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534403#comment-14534403 ] Benedict commented on CASSANDRA-9231: - bq. invalidate less documentation/existing assumptions But we wont invalidate them: it will still be true of the partition key; the routing key would always be a subset of the partition key, so the statements still hold true. The difference is that the partition key distributes the data both within and without the node, whereas the routing key only without. So it's a refinement rather than a rewrite/invalidation. bq. Besides, that's really only one of my point. There are also two things that seem to be conflated in your proposal: per table partitioners, and arbitrary functions as partitioners. The latter is more problematic than the former, since we need to know certain things about the token distribution, such as order preservation, midpoint calculation, random token creation; even ring description is apparently specialized (perhaps this can be abstracted, not sure). However we can deliver a lot of the functionality you suggest with just arbitrary function application to the fields in the partition key when defining the routing key. I don't think this should be in the initial version, for the record, but defining {{PRIMARY KEY (( [truncate(a),b] a, b), ...)}} would achieve the same goal. Permitting per-table IPartitioner declarations also seems like a good thing to support, but seems a different goal to me; that's an even lower level decision, and really all you want is hashed/partitioned. But you want those to be _good_ at their jobs; if you screw that up, C* may behave unexpectedly. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534376#comment-14534376 ] Sylvain Lebresne commented on CASSANDRA-9231: - bq. My point is that from a data modelling perspective, being able to define the values on which you distribute is the concept you care about. Then we agree. But my problem is that it is *exactly* what the partition key is about, it's his purpose, how we explain and define it. Changing that purpose now is confusing (and if that's not the purpose of the partition key anymore, I'm not even sure what purpose it actually has, how you define it simply). Which is why I'm convinced we'll create less confusion and invalidate less documentation/existing assumptions by simply adding an option to define the token function. In that case, the fundamental concept stay the same and the partition key still define the values used for distribution. But the exact way they are used, which already depend on the partitioner today, gain some more flexibility as it can be user defined. The fact that you can write functions that use only some of those value becomes an implementation details, the concept of the partition key is preserved. I don't think changing the meaning of fundamental concepts, nor multiplying them, is a good idea. Besides, that's really only one of my point. We have had many time people wanting to do fancy things with the partitioner but so far the fact that the partitioner is cluster wide, and that making it per-table is pretty annoying has limited what can be done. The use case of the description is really just one special case. Assuming that it's the only smart thing we can do when it comes from computing the token from the partition key feels a bit short sided to me. It's an advanced feature for power users anyway, so lets at least make it powerful. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523197#comment-14523197 ] Benedict commented on CASSANDRA-9231: - Personally I think it is clearer having a routing key as a part the primary key than having a special tokenizer function. It's also syntactically cleaner. Since the user understands the indirection of clustering versus partition key, it isn't a tall order for them to understand a routing key, and it fits more neatly into a mental model than the distinct concept of token (which is more an implementation detail, IMO). I agree it is marginally less general, but it's not mutually exclusive. It is possible for us in future to support function application to fabricate a column inside the routing key declaration only. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523213#comment-14523213 ] Aleksey Yeschenko commented on CASSANDRA-9231: -- I also want to add that if we did choose this way (routing key as part of the partition key), I'd vote for {{DESCRIBE}} *not* indicating the routing part if it exactly matches the whole partition key. Most users won't be confused and won't need to know about the distinction unless they explicitly use the functionality. It's okay to hide it, it being a relatively advanced opt-in feature. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.x Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510753#comment-14510753 ] Robert Stupp commented on CASSANDRA-9231: - Just want to prevent that drivers have to implement the whole UDF exec implementation (which could be difficult for non-Java drivers ;) ). Drivers could possibly accept ”native” functions from the client code to calculate the routing-key if they really need to optimize for token-aware routing. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510713#comment-14510713 ] Robert Stupp commented on CASSANDRA-9231: - Using UDFs for the routing-key looks nice. But I doubt that drivers would be able to compute the routing-key for token-aware routing. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510729#comment-14510729 ] Sylvain Lebresne commented on CASSANDRA-9231: - Not automagically, but it's easy enough to make driver accept custom functions for token-aware routing. And I'm fine provided a couple native function for the most common case (like the use only the ith component of the partition key of the description), which drivers could recognize automagically if they want to. That would still leave the ability to do more complex stuffs. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510656#comment-14510656 ] Sylvain Lebresne commented on CASSANDRA-9231: - If we do this, I have a strong preference for exposing it as a way to define a custom function for computing the token. So the example above would be written something like: {noformat} CREATE FUNCTION myCustomHash(a int, b int) RETURNS bigint AS 'return murmur3(a)'; CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY ((a, b), c) ) WITH tokenizer=myCustomHash; {noformat} That's imo more generic and I don't like adding a notion of routing key when we already have primary key and partition key which is enough key (and internally the routing key is really just the token, so no point in having a new notion). Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511049#comment-14511049 ] Aleksey Yeschenko commented on CASSANDRA-9231: -- bq. Except that it's not all the same result that I described. Can you give me an example then? Ideally something that the driver would still be able to understand. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511011#comment-14511011 ] Aleksey Yeschenko commented on CASSANDRA-9231: -- You'd be able to use more than one component of the partition key. Using the originally proposed syntax (strictly as an example) you could have {{PRIMARY KEY (([a, b, c], d), e, f)}}. Ultimately, for non-routing purposes, the order of the columns in the partition key doesn't matter at all, and the use has full control, so they can reorder/split them as necessary and get the same result. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511017#comment-14511017 ] Sylvain Lebresne commented on CASSANDRA-9231: - bq. so they can reorder/split them as necessary and get the same result Except that it's not all the same result that I described. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510979#comment-14510979 ] Aleksey Yeschenko commented on CASSANDRA-9231: -- I have an equally strong preference to not overcomplicate and overgeneralise this, and just dedicate part of the partition key to routing, not use functions. Don't have to call it a 'routing key', and I'm open to other syntax suggestions though. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510995#comment-14510995 ] Sylvain Lebresne commented on CASSANDRA-9231: - bq. I have an equally strong preference to not overcomplicate and overgeneralise this Well, I disagree that it's *over*generalization, it's just generalization, and generalization don't always mean more complex. In fact, it's imo simpler to use functions than to come up with a new custom concept. Perhaps more importantly, I think that something potentially *more* useful than just using one component of the partition key would be to use both component but only use the first one for first half of the token and the 2nd one for the 2nd half. The result being that partitions having the same first component would be on the same replica or some small number of replicas, but with still some scaling properties if you have very man partition having the same first component. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509551#comment-14509551 ] Aleksey Yeschenko commented on CASSANDRA-9231: -- Additionally, when/if we have CASSANDRA-8857, we'd be able to meaningfully batch partition lookups to different tables so long as the routing key is the same, in a single roundtrip, relying on their co-locality. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key
[ https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509543#comment-14509543 ] Aleksey Yeschenko commented on CASSANDRA-9231: -- I've got a couple more use cases for the feature. If we implement this, we'll start grouping Mutation objects by {keyspace, routing key} tuples instead of {keyspace, partition key} tuples, as we do now. This means that for tables that share the same routing key, but different remaining partition keys, we'd now be able to put them in the same Mutation, and add both updates atomically to the commitlog. This would allow us to get batchlog functionality basically for free for the updates that share the same routing key, be it the same table or several different ones. Support Routing Key as part of Partition Key Key: CASSANDRA-9231 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231 Project: Cassandra Issue Type: Wish Components: Core Reporter: Matthias Broecheler Fix For: 3.1 Provide support for sub-dividing the partition key into a routing key and a non-routing key component. Currently, all columns that make up the partition key of the primary key are also routing keys, i.e. they determine which nodes store the data. This proposal would give the data modeler the ability to designate only a subset of the columns that comprise the partition key to be routing keys. The non-routing key columns of the partition key identify the partition but are not used to determine where to store the data. Consider the following example table definition: CREATE TABLE foo ( a int, b int, c int, d int, PRIMARY KEY (([a], b), c ) ); (a,b) is the partition key, c is the clustering key, and d is just a column. In addition, the square brackets identify the routing key as column a. This means that only the value of column a is used to determine the node for data placement (i.e. only the value of column a is murmur3 hashed to compute the token). In addition, column b is needed to identify the partition but does not influence the placement. This has the benefit that all rows with the same routing key (but potentially different non-routing key columns of the partition key) are stored on the same node and that knowledge of such co-locality can be exploited by applications build on top of Cassandra. Currently, the only way to achieve co-locality is within a partition. However, this approach has the limitations that: a) there are theoretical and (more importantly) practical limitations on the size of a partition and b) rows within a partition are ordered and an index is build to exploit such ordering. For large partitions that overhead is significant if ordering isn't needed. In other words, routing keys afford a simple means to achieve scalable node-level co-locality without ordering while clustering keys afford page-level co-locality with ordering. As such, they address different co-locality needs giving the data modeler the flexibility to choose what is needed for their application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)