Hadoop output SlicePredicate is slow and doesn't work as intended
-----------------------------------------------------------------

                 Key: CASSANDRA-1246
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1246
             Project: Cassandra
          Issue Type: Bug
          Components: Hadoop
    Affects Versions: 0.7
            Reporter: Jonathan Ellis
            Assignee: Jonathan Ellis
             Fix For: 0.7


The output SlicePredicate is only used to attempt to check that no data exists 
in the range that we're going to be writing data.  This is 

(a) slow, since it performs get_range_slices across the entire key range, 
meaning we'll hit every node in the cluster if there is no data (which is 
supposed to be the normal case)
(b) wrong, since it appears to be intended to use keyList.size to allow data in 
column X to not interfere with an output to column Y, but that is not how 
get_range_slices works -- if you have data (or even a tombstone) in any column, 
you'll get the key back in your result list.  so what you would have to do is 
scan every key, and check the list of columns returned, which in the case of 
data actually existing in other columns will be prohibitively slow


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to