[ 
https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229701#comment-14229701
 ] 

Piotr Kołaczkowski edited comment on CASSANDRA-7688 at 12/1/14 12:01 PM:
-------------------------------------------------------------------------

It would be nice to know also the average partition size in the given table, 
both in bytes and in number of CQL rows. This would be useful to set 
appropriate fetch.size. Additionally, current split generation API does not 
allow to set split size in terms of data size in bytes or number of CQL rows, 
but only by number of partitions. Number of partitions doesn't make a nice 
default, as partitions can vary greatly in size and are extremely use-case 
dependent. So please, don't just copy current describe_splits_ex functionality 
to the new driver, but *improve this*. 

We really don't need the driver / Cassandra to do the splitting for us. Instead 
we need to know:

1. estimate of total amount of data in the table in bytes
2. estimate of total number of CQL rows in the table
3. estimate of total number of partitions in the table

We're interested both in totals (whole cluster; logical sizes; i.e. without 
replicas), and split by token-ranges by node (physical; incuding replicas).

Note that this information is useful not just for Spark/Hadoop split 
generation, but also things like e.g. SparkSQL optimizer so it knows how much 
data will it have to process.

The next  step would be providing column data histograms to guide predicate 
selectivity. 


was (Author: pkolaczk):
It would be nice to know also the average partition size in the given table, 
both in bytes and in number of CQL rows. This would be useful to set 
appropriate fetch.size. Additionally, current split generation API does not 
allow to set split size in terms of data size in bytes or number of CQL rows, 
but only by number of partitions. Number of partitions doesn't make a nice 
default, as partitions can vary greatly in size and are extremely use-case 
dependent. So please, don't just copy current describe_splits_ex functionality 
to the new driver, but *improve this*. 

We really don't need the driver / Cassandra to do the splitting for us. Instead 
we need to know:

1. estimate of total amount of data in the table in bytes
2. estimate of total number of CQL rows in the table
3. estimate of total number of partitions in the table

We're interested both in totals (whole cluster; logical sizes; i.e. without 
replicas), and split by token-ranges by node (physical; incuding replicas).

> Add data sizing to a system table
> ---------------------------------
>
>                 Key: CASSANDRA-7688
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jeremiah Jordan
>             Fix For: 2.1.3
>
>
> Currently you can't implement something similar to describe_splits_ex purely 
> from the a native protocol driver.  
> https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily 
> getting ownership information to a client in the java-driver.  But you still 
> need the data sizing part to get splits of a given size.  We should add the 
> sizing information to a system table so that native clients can get to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to