[
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg updated MAPREDUCE-6549:
---------------------------------------------
Status: Open (was: Patch Available)
I tried the change that you made in the patch and it fails the current tests.
The patch changes one test (TestLineRecordReader.java) but we have two
versions. The mapred version is unchanged and now fails. The mapreduce version
works but as soon as I change the delimiter back it also fails. That means that
the change does not fix the issue.
it also brings the two tests out of sync which is not correct
> multibyte delimiters with LineRecordReader cause duplicate records
> ------------------------------------------------------------------
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.7.2
> Reporter: Dustin Cote
> Assignee: Dustin Cote
> Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain
> scenarios such as:
> 1) input string: "abc+++def++ghi++"
> delimiter string: "+++"
> test passes with all sizes of the split
> 2) input string: "abc++def+++ghi++"
> delimiter string: "+++"
> test fails with a split size of 4
> 2) input string: "abc+++def++ghi++"
> delimiter string: "++"
> test fails with a split size of 5
> 3) input string "abc+++defg++hij++"
> delimiter string: "++"
> test fails with a split size of 4
> 4) input string "abc++def+++ghi++"
> delimiter string: "++"
> test fails with a split size of 9
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)