[jira] Commented: (CASSANDRA-1042) ColumnFamilyRecordReader returns duplicate rows

Jeremy Hanna (JIRA) Tue, 06 Jul 2010 13:19:17 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885669#action_12885669
 ]


Jeremy Hanna commented on CASSANDRA-1042:
-----------------------------------------

Sorry if this is redundant but pasting in a thought we had a while ago that 
motivated the attached patch.  If we make sure that the splits are always in 
ring order and never wrap, it solves the problem.

"Token ranges may also wrap -- that is, the end token may be less than the 
start one. Thus, a range from keyX to keyX is a one-element range, but a range 
from tokenY to tokenY is the full ring."

It does not say what order they will be in when it wraps.  Some clients assume 
that the ordering is natural order while the hadoop client interactions assume 
that it will be ring order.

For example:
-- a list of tokens (1,2,3,4,5,6,7,8,9)
-- a get_range_slice call with start_token = 5, end_token = 5
Natural order meaning token order from start to finish, returning the results 
(1,2,3,4,5,6,7.8,9).
Ring order or wrapping order meaning it would return the results 
(5,6,7,8,9,1,2,3,4).

> ColumnFamilyRecordReader returns duplicate rows
> -----------------------------------------------
>
>                 Key: CASSANDRA-1042
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1042
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Joost Ouwerkerk
>            Assignee: Jeremy Hanna
>             Fix For: 0.6.4
>
>         Attachments: 1042-0_6.txt, Cassandra-1042-0_6-branch.patch.txt, 
> CASSANDRA-1042-trunk.patch.txt, cassandra.tar.gz, duplicate_keys.rtf
>
>
> There's a bug in ColumnFamilyRecordReader that appears when processing a 
> single split (which happens in most tests that have small number of rows), 
> and potentially in other cases.  When the start and end tokens of the split 
> are equal, duplicate rows can be returned.
> Example with 5 rows:
> token (start and end) = 53193025635115934196771903670925341736
> Tokens returned by first get_range_slices iteration (all 5 rows):
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
>  99079589977253916124855502156832923443
>  144992942750327304334463589818972416113
>  166860289390734216023086131251507064403
> Tokens returned by next iteration (first token is last token from
> previous, end token is unchanged)
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
> Tokens returned by final iteration  (first token is last token from
> previous, end token is unchanged)
>  [] (empty)
> In this example, the mapper has processed 7 rows in total, 2 of which
> were duplicates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1042) ColumnFamilyRecordReader returns duplicate rows

Reply via email to