[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2016-02-09 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139147#comment-15139147
 ] 

Jeremiah Jordan commented on CASSANDRA-9231:


I think we probably have other issues to solve besides CASSANDRA-9754 for 
multi-GB partitions to be viable?  Are you not going to still have operational 
issues around repairing them and compacting them still?

> Support Routing Key as part of Partition Key
> 
>
> Key: CASSANDRA-9231
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
> Project: Cassandra
>  Issue Type: Wish
>Reporter: Matthias Broecheler
>
> Provide support for sub-dividing the partition key into a routing key and a 
> non-routing key component. Currently, all columns that make up the partition 
> key of the primary key are also routing keys, i.e. they determine which nodes 
> store the data. This proposal would give the data modeler the ability to 
> designate only a subset of the columns that comprise the partition key to be 
> routing keys. The non-routing key columns of the partition key identify the 
> partition but are not used to determine where to store the data.
> Consider the following example table definition:
> CREATE TABLE foo (
>   a int,
>   b int,
>   c int,
>   d int,
>   PRIMARY KEY  (([a], b), c ) );
> (a,b) is the partition key, c is the clustering key, and d is just a column. 
> In addition, the square brackets identify the routing key as column a. This 
> means that only the value of column a is used to determine the node for data 
> placement (i.e. only the value of column a is murmur3 hashed to compute the 
> token). In addition, column b is needed to identify the partition but does 
> not influence the placement.
> This has the benefit that all rows with the same routing key (but potentially 
> different non-routing key columns of the partition key) are stored on the 
> same node and that knowledge of such co-locality can be exploited by 
> applications build on top of Cassandra.
> Currently, the only way to achieve co-locality is within a partition. 
> However, this approach has the limitations that: a) there are theoretical and 
> (more importantly) practical limitations on the size of a partition and b) 
> rows within a partition are ordered and an index is build to exploit such 
> ordering. For large partitions that overhead is significant if ordering isn't 
> needed.
> In other words, routing keys afford a simple means to achieve scalable 
> node-level co-locality without ordering while clustering keys afford 
> page-level co-locality with ordering. As such, they address different 
> co-locality needs giving the data modeler the flexibility to choose what is 
> needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2016-02-09 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139158#comment-15139158
 ] 

Jonathan Ellis commented on CASSANDRA-9231:
---

Repair: shouldn't be an issue now that we have incremental mode.

Compaction: unclear how much extra write amplification will happen vs having 
them in separate partitions but same machine.  (vnode-based compaction doesn't 
help with either one.)  On balance I'd say we'd be well served by improving 
compaction in general.

> Support Routing Key as part of Partition Key
> 
>
> Key: CASSANDRA-9231
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
> Project: Cassandra
>  Issue Type: Wish
>Reporter: Matthias Broecheler
>
> Provide support for sub-dividing the partition key into a routing key and a 
> non-routing key component. Currently, all columns that make up the partition 
> key of the primary key are also routing keys, i.e. they determine which nodes 
> store the data. This proposal would give the data modeler the ability to 
> designate only a subset of the columns that comprise the partition key to be 
> routing keys. The non-routing key columns of the partition key identify the 
> partition but are not used to determine where to store the data.
> Consider the following example table definition:
> CREATE TABLE foo (
>   a int,
>   b int,
>   c int,
>   d int,
>   PRIMARY KEY  (([a], b), c ) );
> (a,b) is the partition key, c is the clustering key, and d is just a column. 
> In addition, the square brackets identify the routing key as column a. This 
> means that only the value of column a is used to determine the node for data 
> placement (i.e. only the value of column a is murmur3 hashed to compute the 
> token). In addition, column b is needed to identify the partition but does 
> not influence the placement.
> This has the benefit that all rows with the same routing key (but potentially 
> different non-routing key columns of the partition key) are stored on the 
> same node and that knowledge of such co-locality can be exploited by 
> applications build on top of Cassandra.
> Currently, the only way to achieve co-locality is within a partition. 
> However, this approach has the limitations that: a) there are theoretical and 
> (more importantly) practical limitations on the size of a partition and b) 
> rows within a partition are ordered and an index is build to exploit such 
> ordering. For large partitions that overhead is significant if ordering isn't 
> needed.
> In other words, routing keys afford a simple means to achieve scalable 
> node-level co-locality without ordering while clustering keys afford 
> page-level co-locality with ordering. As such, they address different 
> co-locality needs giving the data modeler the flexibility to choose what is 
> needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-06-15 Thread Benjamin Coverston (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586256#comment-14586256
 ] 

Benjamin Coverston commented on CASSANDRA-9231:
---

I'm also -1 on adding UDFs into the mix, just on the merits of losing the token 
aware routing from the client. A simple designation of a some of the partition 
keys as routing keys would serve the use cases I'm aware of.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-08 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534202#comment-14534202
 ] 

Sylvain Lebresne commented on CASSANDRA-9231:
-

What I'm talking about is basically the idea of CASSANDRA-5054. Or to put it 
another way, we could use a function like:
{noformat}
CREATE FUNCTION myTokenFct(a int, b int) RETURNS bigint AS 
$$
long high = murmur3(a);
long low = murmur3(b);
return (high  0x) | (low  0x);
$$;
{noformat}
The goal being to make it likely that partitions with the same value for {{a}} 
are on a small amount of nodes but without forcing everything on the same node 
(the latter having a fair amount of foot-shooting potential). But that's really 
just an example. You could imagine to actually have a specific table that is 
ordered (in a predictable way) without having to use {{ByteOrderPartitioner}} 
for the whole cluster:
{noformat}
CREATE FUNCTION myOrderedTokenFct(a bigint) RETURNS bigint AS 'return a';
CREATE TABLE t (
   a int PRIMARY KEY,
   b text,
   c text
) with tokenizer=myOrderedTokenFct;
{noformat}

Basically, this gets you very close to a per-table partitioner. The actual 
partitioner would just define the domain of the tokens and how they sort, but 
the actual computation would be per-table. And this for very, very little 
change to the syntax and barely more complexity code-wise than the routing 
key idea.

Of course, this will be an advanced feature that people should use at their own 
risk.  But that's true of the routing key idea too: we'd better label it as 
an advanced feature or I'm certain people will misuse it and shoot themselves 
in the foot more often than not. This is also why I'm not too worried about the 
drivers parts: it's simple to say that if you use a custom token function, 
which will be rare in the first place, then you have to provide it to the 
driver too to get token awareness (which is not saying that this isn't a small 
downside, but it's a very small one in practice and given the context).

Perhaps more importantly, I think the function idea is conceptually *simpler* 
than the routing key idea. All that you basically have to say is that we allow 
you to define the {{token}} function on a per-table basis, the exact same 
function that already exists and can be used in {{SELECT}}.

While the routing key concept (or whatever name we would pick) is imo more 
confusing. You have to explain that on top of the _primary key_ having a 
subpart that is the _partition key_, you also have a subpart of the latter 
which is now the _routing key_. And how do you define what the _partition key_ 
is now in simple terms? Well, I don't know, because once you have a routing key 
that is different from the partition key, the partition key start to be kind of 
an implementation detail. It's the thing that don't really determine where 
the row is distributed, but is not part of the clustering so you can't query it 
like a clustering column because ... because?

Honestly, allowing to provide custom {{token}} function per table is 1) more 
powerful and 2) imo way more easy to explain conceptually and this without 
fuzzing existing concept. So I'm a -1 on the routing key concept unless I'm 
proved that the custom {{token}} function idea doesn't work, is substantially 
more complex to implement or has fundamental flaws I have missed. I would hate 
to add the routing key idea to realize that some other user has a clever 
routing idea that is just not handled by the routing key (and having to add 
some new custom concept).

bq. the distinct concept of token (which is more an implementation detail, 
IMO)

Your opinion are your own, but the token is most definitively *not* an 
implementation detail since 1) we have a {{token}} function in CQL to compute 
it and 2) we reference it all the time in the documentation, have scores of 
options that mention it, it's exposed by drivers, etc... Actually, the fact 
that we would use the token concept rather than adding a new custom one is part 
of why I'm convinced it's conceptually simpler: everyone that knows Cassandra 
knows of tokens.


 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition 

[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534220#comment-14534220
 ] 

Benedict commented on CASSANDRA-9231:
-

The token is an implementation detail for the _concept_ of routing, or fair 
distribution. Perhaps we have different definitions of implementation detail, 
but I stand by it under my nomenclature, and the presence of a {{token}} 
function doesn't really change that.

My point is that from a data modelling perspective, being able to define the 
values on which you distribute is the concept you care about. The token that is 
ultimately used to deliver that is not important for you when modelling your 
system.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535326#comment-14535326
 ] 

Benedict commented on CASSANDRA-9231:
-

bq. They wouldn't be providing arbitrary tokens, they would be providing 
arbitrary input to the hash function (for Random, MP3).

{code}
CREATE FUNCTION myOrderedTokenFct(a bigint) RETURNS bigint AS 'return a';
CREATE TABLE t (
   a int PRIMARY KEY,
   b text,
   c text
) with tokenizer=myOrderedTokenFct;
{code}
 
bq. Basically, this gets you very close to a per-table partitioner. The actual 
partitioner would just define the domain of the tokens and how they sort, but 
the actual computation would be per-table. And this for very, very little 
change to the syntax and barely more complexity code-wise than the routing 
key idea.



 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-08 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535319#comment-14535319
 ] 

Tyler Hobbs commented on CASSANDRA-9231:


bq.  However I would point out that letting the user provide an arbitrary token 
lets them, for instance, break the order preserving assumptions of BOP, or 
the fair distribution assumptions of the hash partitioner.

They wouldn't be providing arbitrary tokens, they would be providing arbitrary 
input to the hash function (for Random, MP3).  The distribution would be 
approximately as fair as it would be without the transform step.

For BOP they would maintain the order of whatever the function returns, which 
makes sense and seems like exactly what the user would want.

FWIW, I agree with Sylvain's preference for using functions rather than a 
routing key, for the same reasons he lists.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-08 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534445#comment-14534445
 ] 

Aleksey Yeschenko commented on CASSANDRA-9231:
--

As it stands now, I'm -1 on involving UDFs here. The use case I have in mind is 
the only *real* use case I've heard, from just 2 users. They'd be better served 
by the less complicated designation of some of the partition key columns for 
calculating the token and don't need this extra power.

Don't have much to add, otherwise. The ticket is not - yet - urgent, there is 
at least a few months ahead before starting to work on it. I'm going to wait 
for some other use cases before I'm convinced that the full UDF approach makes 
any sense here, and put this issue on hold otherwise.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534448#comment-14534448
 ] 

Benedict commented on CASSANDRA-9231:
-

I think we're just making the same arguments back and forth, so I'll leave it 
here for now. However I would point out that letting the user provide an 
arbitrary token lets them, for instance, break the order preserving 
assumptions of BOP, or the fair distribution assumptions of the hash 
partitioner. This latter in particular could lead to many future optimizations 
(e.g. CASSANDRA-7282) instead degrading such a cluster.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534403#comment-14534403
 ] 

Benedict commented on CASSANDRA-9231:
-

bq.  invalidate less documentation/existing assumptions

But we wont invalidate them: it will still be true of the partition key; the 
routing key would always be a subset of the partition key, so the statements 
still hold true. The difference is that the partition key distributes the data 
both within and without the node, whereas the routing key only without. So it's 
a refinement rather than a rewrite/invalidation.

bq. Besides, that's really only one of my point.

There are also two things that seem to be conflated in your proposal: per table 
partitioners, and arbitrary functions as partitioners. The latter is more 
problematic than the former, since we need to know certain things about the 
token distribution, such as order preservation, midpoint calculation, random 
token creation; even ring description is apparently specialized (perhaps this 
can be abstracted, not sure). 

However we can deliver a lot of the functionality you suggest with just 
arbitrary function application to the fields in the partition key when defining 
the routing key. I don't think this should be in the initial version, for the 
record, but defining {{PRIMARY KEY (( [truncate(a),b] a, b), ...)}} would 
achieve the same goal. 

Permitting per-table IPartitioner declarations also seems like a good thing to 
support, but seems a different goal to me; that's an even lower level decision, 
and really all you want is hashed/partitioned. But you want those to be _good_ 
at their jobs; if you screw that up, C* may behave unexpectedly.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-08 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534376#comment-14534376
 ] 

Sylvain Lebresne commented on CASSANDRA-9231:
-

bq. My point is that from a data modelling perspective, being able to define 
the values on which you distribute is the concept you care about.

Then we agree. But my problem is that it is *exactly* what the partition key is 
about, it's his purpose, how we explain and define it. Changing that purpose 
now is confusing (and if that's not the purpose of the partition key anymore, 
I'm not even sure what purpose it actually has, how you define it simply).

Which is why I'm convinced we'll create less confusion and invalidate less 
documentation/existing assumptions by simply adding an option to define the 
token function. In that case, the fundamental concept stay the same and the 
partition key still define the values used for distribution. But the exact way 
they are used, which already depend on the partitioner today, gain some more 
flexibility as it can be user defined. The fact that you can write functions 
that use only some of those value becomes an implementation details, the 
concept of the partition key is preserved. I don't think changing the meaning 
of fundamental concepts, nor multiplying them, is a good idea.

Besides, that's really only one of my point. We have had many time people 
wanting to do fancy things with the partitioner but so far the fact that the 
partitioner is cluster wide, and that making it per-table is pretty annoying 
has limited what can be done. The use case of the description is really just 
one special case. Assuming that it's the only smart thing we can do when it 
comes from computing the token from the partition key feels a bit short sided 
to me. It's an advanced feature for power users anyway, so lets at least make 
it powerful.


 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523197#comment-14523197
 ] 

Benedict commented on CASSANDRA-9231:
-

Personally I think it is clearer having a routing key as a part the primary 
key than having a special tokenizer function. It's also syntactically cleaner. 
Since the user understands the indirection of clustering versus partition key, 
it isn't a tall order for them to understand a routing key, and it fits more 
neatly into a mental model than the distinct concept of token (which is more 
an implementation detail, IMO). I agree it is marginally less general, but it's 
not mutually exclusive. It is possible for us in future to support function 
application to fabricate a column inside the routing key declaration only.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-05-01 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523213#comment-14523213
 ] 

Aleksey Yeschenko commented on CASSANDRA-9231:
--

I also want to add that if we did choose this way (routing key as part of the 
partition key), I'd vote for {{DESCRIBE}} *not* indicating the routing part if 
it exactly matches the whole partition key.

Most users won't be confused and won't need to know about the distinction 
unless they explicitly use the functionality. It's okay to hide it, it being a 
relatively advanced opt-in feature. 

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.x


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-24 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510753#comment-14510753
 ] 

Robert Stupp commented on CASSANDRA-9231:
-

Just want to prevent that drivers have to implement the whole UDF exec 
implementation (which could be difficult for non-Java drivers ;) ). 

Drivers could possibly accept ”native” functions from the client code to 
calculate the routing-key if they really need to optimize for token-aware 
routing.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-24 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510713#comment-14510713
 ] 

Robert Stupp commented on CASSANDRA-9231:
-

Using UDFs for the routing-key looks nice. But I doubt that drivers would be 
able to compute the routing-key for token-aware routing.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-24 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510729#comment-14510729
 ] 

Sylvain Lebresne commented on CASSANDRA-9231:
-

Not automagically, but it's easy enough to make driver accept custom functions 
for token-aware routing. And I'm fine provided a couple native function for the 
most common case (like the use only the ith component of the partition key of 
the description), which drivers could recognize automagically if they want to. 
That would still leave the ability to do more complex stuffs.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-24 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510656#comment-14510656
 ] 

Sylvain Lebresne commented on CASSANDRA-9231:
-

If we do this, I have a strong preference for exposing it as a way to define a 
custom function for computing the token. So the example above would be written 
something like:
{noformat}
CREATE FUNCTION myCustomHash(a int, b int) RETURNS bigint AS 'return 
murmur3(a)';

CREATE TABLE foo (
a int,
b int,
c int,
d int,
PRIMARY KEY ((a, b), c)
) WITH tokenizer=myCustomHash;
{noformat}

That's imo more generic and I don't like adding a notion of routing key when 
we already have primary key and partition key which is enough key (and 
internally the routing key is really just the token, so no point in having a 
new notion).

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-24 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511049#comment-14511049
 ] 

Aleksey Yeschenko commented on CASSANDRA-9231:
--

bq. Except that it's not all the same result that I described.

Can you give me an example then? Ideally something that the driver would still 
be able to understand.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-24 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511011#comment-14511011
 ] 

Aleksey Yeschenko commented on CASSANDRA-9231:
--

You'd be able to use more than one component of the partition key. Using the 
originally proposed syntax (strictly as an example) you could have {{PRIMARY 
KEY (([a, b, c], d), e, f)}}. Ultimately, for non-routing purposes, the order 
of the columns in the partition key doesn't matter at all, and the use has full 
control, so they can reorder/split them as necessary and get the same result.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-24 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511017#comment-14511017
 ] 

Sylvain Lebresne commented on CASSANDRA-9231:
-

bq. so they can reorder/split them as necessary and get the same result

Except that it's not all the same result that I described.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-24 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510979#comment-14510979
 ] 

Aleksey Yeschenko commented on CASSANDRA-9231:
--

I have an equally strong preference to not overcomplicate and overgeneralise 
this, and just dedicate part of the partition key to routing, not use functions.

Don't have to call it a 'routing key', and I'm open to other syntax suggestions 
though.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-24 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510995#comment-14510995
 ] 

Sylvain Lebresne commented on CASSANDRA-9231:
-

bq. I have an equally strong preference to not overcomplicate and 
overgeneralise this

Well, I disagree that it's *over*generalization, it's just generalization, and 
generalization don't always mean more complex. In fact, it's imo simpler to use 
functions than to come up with a new custom concept. Perhaps more importantly, 
I think that something potentially *more* useful than just using one component 
of the partition key would be to use both component but only use the first one 
for first half of the token and the 2nd one for the 2nd half. The result being 
that partitions having the same first component would be on the same replica or 
some small number of replicas, but with still some scaling properties if you 
have very man partition having the same first component.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-23 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509551#comment-14509551
 ] 

Aleksey Yeschenko commented on CASSANDRA-9231:
--

Additionally, when/if we have CASSANDRA-8857, we'd be able to meaningfully 
batch partition lookups to different tables so long as the routing key is the 
same, in a single roundtrip, relying on their co-locality.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

2015-04-23 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509543#comment-14509543
 ] 

Aleksey Yeschenko commented on CASSANDRA-9231:
--

I've got a couple more use cases for the feature.

If we implement this, we'll start grouping Mutation objects by {keyspace, 
routing key} tuples instead of {keyspace, partition key} tuples, as we do now. 
This means that for tables that share the same routing key, but different 
remaining partition keys, we'd now be able to put them in the same Mutation, 
and add both updates atomically to the commitlog.

This would allow us to get batchlog functionality basically for free for the 
updates that share the same routing key, be it the same table or several 
different ones.

 Support Routing Key as part of Partition Key
 

 Key: CASSANDRA-9231
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
 Project: Cassandra
  Issue Type: Wish
  Components: Core
Reporter: Matthias Broecheler
 Fix For: 3.1


 Provide support for sub-dividing the partition key into a routing key and a 
 non-routing key component. Currently, all columns that make up the partition 
 key of the primary key are also routing keys, i.e. they determine which nodes 
 store the data. This proposal would give the data modeler the ability to 
 designate only a subset of the columns that comprise the partition key to be 
 routing keys. The non-routing key columns of the partition key identify the 
 partition but are not used to determine where to store the data.
 Consider the following example table definition:
 CREATE TABLE foo (
   a int,
   b int,
   c int,
   d int,
   PRIMARY KEY  (([a], b), c ) );
 (a,b) is the partition key, c is the clustering key, and d is just a column. 
 In addition, the square brackets identify the routing key as column a. This 
 means that only the value of column a is used to determine the node for data 
 placement (i.e. only the value of column a is murmur3 hashed to compute the 
 token). In addition, column b is needed to identify the partition but does 
 not influence the placement.
 This has the benefit that all rows with the same routing key (but potentially 
 different non-routing key columns of the partition key) are stored on the 
 same node and that knowledge of such co-locality can be exploited by 
 applications build on top of Cassandra.
 Currently, the only way to achieve co-locality is within a partition. 
 However, this approach has the limitations that: a) there are theoretical and 
 (more importantly) practical limitations on the size of a partition and b) 
 rows within a partition are ordered and an index is build to exploit such 
 ordering. For large partitions that overhead is significant if ordering isn't 
 needed.
 In other words, routing keys afford a simple means to achieve scalable 
 node-level co-locality without ordering while clustering keys afford 
 page-level co-locality with ordering. As such, they address different 
 co-locality needs giving the data modeler the flexibility to choose what is 
 needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)