Returning split length of 0 confuses Pig
----------------------------------------
Key: CASSANDRA-2184
URL: https://issues.apache.org/jira/browse/CASSANDRA-2184
Project: Cassandra
Issue Type: Bug
Components: Hadoop
Affects Versions: 0.6
Reporter: Jonathan Ellis
Priority: Minor
Fix For: 0.7.3
Matt Kennedy reports on the user list,
bq. There is a new feature in Pig 0.8 that will try to reduce the number of
splits used to speed up the whole job. Since the ColumnFamilyInputFormat lists
the input size as zero, this feature eliminates all of the splits except for
one.
bq. The workaround is to disable this feature for jobs that use
CassandraStorage by setting -Dpig.splitCombination=false in the pig_cassandra
script.
{noformat}
bq. However, we wanted to keep splitCombination on because it is a useful
optimization for a lot of our use cases, so I went digging for the least
intrusive way to keep the split combiner on, but also prevent it from combining
splits that read from Cassandra. My solution, which you are welcome to
critique, is to change line 65 of
http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java
such that it returns Long.MAX_VALUE instead of zero.
I looked into actually returning the number of keys in the split but Hadoop
javadoc says "Get the size of the split, so that the input splits can be sorted
by size" so since our splits should be very very close in size this doesn't
sound like it's worth doing an extra round trip to the host servers to get
super accurate numbers on. Returning MAX_VALUE seems like it's good enough.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira