[ https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229828#comment-14229828 ]
Piotr Kołaczkowski commented on CASSANDRA-7688: ----------------------------------------------- We only need estimates, not exact values. Factor 1.5x error is considered an awesome estimate, factor 3x is still fairly good. Also note that Spark/Hadoop does many token range scans. Maybe collecting some statistics on the fly, during the scans (or during the compaction) would be viable? And running a full compaction to get statistics more accurate - why not? You need to do it anyway to get top speed when scanning data in Spark, because a full table scan is doing kind-of implicit compaction anyway, isn't it? Also, one more thing - it would be good to have those values per column (sorry for making it even harder, I know it is not an easy task). At least to know that a column is responsible for xx% of data in the table - knowing such thing would make a huge difference when estimating data size, because we're not always fetching all columns and they may vary in size a lot (e.g. collections!). Some sampling on insert would probably be enough. > Add data sizing to a system table > --------------------------------- > > Key: CASSANDRA-7688 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7688 > Project: Cassandra > Issue Type: New Feature > Reporter: Jeremiah Jordan > Fix For: 2.1.3 > > > Currently you can't implement something similar to describe_splits_ex purely > from the a native protocol driver. > https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily > getting ownership information to a client in the java-driver. But you still > need the data sizing part to get splits of a given size. We should add the > sizing information to a system table so that native clients can get to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)