[ 
https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229828#comment-14229828
 ] 

Piotr Kołaczkowski commented on CASSANDRA-7688:
-----------------------------------------------

We only need estimates, not exact values. Factor 1.5x error is considered an 
awesome estimate, factor 3x is still fairly good. 
Also note that Spark/Hadoop does many token range scans. Maybe collecting some 
statistics on the fly, during the scans (or during the compaction) would be 
viable?  And running a full compaction to get statistics more accurate - why 
not? You need to do it anyway to get top speed when scanning data in Spark, 
because a full table scan is doing kind-of implicit compaction anyway, isn't 
it? 

Also, one more thing - it would be good to have those values per column (sorry 
for making it even harder, I know it is not an easy task). At least to know 
that a column is responsible for xx% of data in the table - knowing such thing 
would make a huge difference when estimating data size, because we're not 
always fetching all columns and they may vary in size a lot (e.g. 
collections!). Some sampling on insert would probably be enough.


> Add data sizing to a system table
> ---------------------------------
>
>                 Key: CASSANDRA-7688
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jeremiah Jordan
>             Fix For: 2.1.3
>
>
> Currently you can't implement something similar to describe_splits_ex purely 
> from the a native protocol driver.  
> https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily 
> getting ownership information to a client in the java-driver.  But you still 
> need the data sizing part to get splits of a given size.  We should add the 
> sizing information to a system table so that native clients can get to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to