[ 
https://issues.apache.org/jira/browse/CASSANDRA-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996630#comment-12996630
 ] 

Hudson commented on CASSANDRA-2184:
-----------------------------------

Integrated in Cassandra-0.7 #296 (See 
[https://hudson.apache.org/hudson/job/Cassandra-0.7/296/])
    Change split length from 0 to Long.MAX_VALUE
Patch by Matt Kennedy, reviewed by brandonwilliams for CASSANDRA-2184


> Returning split length of 0 confuses Pig
> ----------------------------------------
>
>                 Key: CASSANDRA-2184
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2184
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 0.7.3
>
>
> Matt Kennedy reports on the user list,
> bq. There is a new feature in Pig 0.8 that will try to reduce the number of 
> splits used to speed up the whole job.  Since the ColumnFamilyInputFormat 
> lists the input size as zero, this feature eliminates all of the splits 
> except for one. 
> bq. The workaround is to disable this feature for jobs that use 
> CassandraStorage by setting -Dpig.splitCombination=false in the pig_cassandra 
> script.
> {noformat}
> bq. However, we wanted to keep splitCombination on because it is a useful 
> optimization for a lot of our use cases, so I went digging for the least 
> intrusive way to keep the split combiner on, but also prevent it from 
> combining splits that read from Cassandra.  My solution, which you are 
> welcome to critique, is to change line 65 of 
> http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java
>  such that it returns Long.MAX_VALUE instead of zero.
> I looked into actually returning the number of keys in the split but Hadoop 
> javadoc says "Get the size of the split, so that the input splits can be 
> sorted by size" so since our splits should be very very close in size this 
> doesn't sound like it's worth doing an extra round trip to the host servers 
> to get super accurate numbers on.  Returning MAX_VALUE seems like it's good 
> enough.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to