[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013993#comment-15013993
 ] 

Wilfred Spiegelenburg commented on MAPREDUCE-6549:
--------------------------------------------------

I have been able to generate a compressed file which shows a same record 
duplication as was shown in the uncompressed processing. The code however 
behaves completely different in the two cases since we do not have the same 
kind of buffer filling process. I am still trying to fix the compressed code 
without breaking the uncompressed code.

I should have a fix for both cases in a day or two.

> multibyte delimiters with LineRecordReader cause duplicate records
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6549
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv1, mrv2
>    Affects Versions: 2.7.2
>            Reporter: Dustin Cote
>            Assignee: Wilfred Spiegelenburg
>         Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to