[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

Dustin Cote (JIRA) Sat, 14 Nov 2015 13:40:04 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dustin Cote updated MAPREDUCE-6549:
-----------------------------------
    Attachment: MAPREDUCE-6549-1.patch

Attaching a patch to basically remove the attempt to read the last incomplete 
record of an input and change the tests to test a more generic, imperfect 
scenario.  I'll add some more tests if review deems it necessary.  As far as I 
am aware, we should drop an incomplete record at the end of the input, which 
now this happens with this patch in addition to the correct number of records 
coming up in the middle of the input (where previously there were duplicates).

> multibyte delimiters with LineRecordReader cause duplicate records
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6549
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.7.2
>            Reporter: Dustin Cote
>            Assignee: Dustin Cote
>         Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

Reply via email to