[ 
https://issues.apache.org/jira/browse/CASSANDRA-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-1042:
--------------------------------------

    Attachment: 1042-test.txt

it seems that the root of the problem is, as Jeremy said, rows getting returned 
in token order instead of ring order.  if, in joost's original example, the 
rows were returned in order of

99079589977253916124855502156832923443
144992942750327304334463589818972416113
166860289390734216023086131251507064403
16955237001963240173058271559858726497
40670782773005619916245995581909898190

then doing an extra query for (40670782773005619916245995581909898190, 
53193025635115934196771903670925341736]

would return the desired result of nothing.

but I am unable to reproduce this behavior in a unit test (against 0.6 branch, 
attached).  trying jeremy's data dir (also against 0.6 branch), I get 
"java.io.IOException: Found system table files, but they couldn't be loaded. 
Did you change the partitioner?" 

> ColumnFamilyRecordReader returns duplicate rows
> -----------------------------------------------
>
>                 Key: CASSANDRA-1042
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1042
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Joost Ouwerkerk
>            Assignee: Jeremy Hanna
>             Fix For: 0.6.4
>
>         Attachments: 1042-0_6.txt, 1042-test.txt, 
> Cassandra-1042-0_6-branch.patch.txt, CASSANDRA-1042-trunk.patch.txt, 
> cassandra.tar.gz, duplicate_keys.rtf
>
>
> There's a bug in ColumnFamilyRecordReader that appears when processing a 
> single split (which happens in most tests that have small number of rows), 
> and potentially in other cases.  When the start and end tokens of the split 
> are equal, duplicate rows can be returned.
> Example with 5 rows:
> token (start and end) = 53193025635115934196771903670925341736
> Tokens returned by first get_range_slices iteration (all 5 rows):
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
>  99079589977253916124855502156832923443
>  144992942750327304334463589818972416113
>  166860289390734216023086131251507064403
> Tokens returned by next iteration (first token is last token from
> previous, end token is unchanged)
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
> Tokens returned by final iteration  (first token is last token from
> previous, end token is unchanged)
>  [] (empty)
> In this example, the mapper has processed 7 rows in total, 2 of which
> were duplicates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to