[
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg updated MAPREDUCE-6549:
---------------------------------------------
Attachment: MAPREDUCE-6549-2.patch
The issue is related to [MAPREDUCE-6481]. That jira changed the position
calculation and made sure that the full records are returned by the reader as
expected. It did not anticipate the record duplication. Junit tests also did
not cover the use cases correctly to discover the issue.
The problem is limited to multi byte delimiters only as far as I can trace.
The junit tests for the multi byte delimiter only take the best case scenario
into account. The input data contained the exact delimiter and no ambiguous
characters. As soon as the test is changed, either the delimiter or the input
data, a failure will be triggered. The issue with the failure is that it does
not clearly show when and how it fails. Analysis of the test failures shows
that a complex combination of input data, split and buffer size will trigger a
failure.
Based on testing the duplication of the record occurs only if:
- the first character(s) of the delimiter are part of the record data, example:
1) the delimiter is {{\+=}} and the data contains a {{\+}} and is not
followed by {{=}}
2) the delimiter is {{\+=\+=}} and the data contains {{\+=\+}} and is not
followed by {{=}}
- the delimiter character is found at the split boundary: the last character
before the split ends
- a fill of the buffer is triggered to finish processing the record
The underlying problem is that we set a flag called {{needAdditionalRecord}} in
the {{UncompressedSplitLineReader}} when we fill the buffer and have
encountered part of a delimiter in combination with a split. We keep track of
this in the ambiguous character number. However is it turns out that if the
character(s) found after that point do not belong to a delimiter we do not
unset the {{needAdditionalRecord}}. This causes the next record to be read
twice and thus we see a duplication of records.
The solution would be to unset the flag when we detect that we're not
processing a delimiter. We currently only add the ambiguous characters to the
record read and set the number back to 0. At the same point we need to unset
the flag.
The patch was developed based on junit tests that exercise the split and buffer
settings in combination with multiple delimiter types using different inputs.
All cases now provide a consistent count of records and correct position inside
the data.
> multibyte delimiters with LineRecordReader cause duplicate records
> ------------------------------------------------------------------
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.7.2
> Reporter: Dustin Cote
> Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain
> scenarios such as:
> 1) input string: "abc+++def++ghi++"
> delimiter string: "+++"
> test passes with all sizes of the split
> 2) input string: "abc++def+++ghi++"
> delimiter string: "+++"
> test fails with a split size of 4
> 2) input string: "abc+++def++ghi++"
> delimiter string: "++"
> test fails with a split size of 5
> 3) input string "abc+++defg++hij++"
> delimiter string: "++"
> test fails with a split size of 4
> 4) input string "abc++def+++ghi++"
> delimiter string: "++"
> test fails with a split size of 9
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)