[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
---------------------------------------------
    Attachment: MAPREDUCE-6549-2.patch

The issue is related to [MAPREDUCE-6481]. That jira changed the position 
calculation and made sure that the full records are returned by the reader as 
expected. It did not anticipate the record duplication. Junit tests also did 
not cover the use cases correctly to discover the issue.
The problem is limited to multi byte delimiters only as far as I can trace. 

The junit tests for the multi byte delimiter only take the best case scenario 
into account. The input data contained the exact delimiter and no ambiguous 
characters. As soon as the test is changed, either the delimiter or the input 
data, a failure will be triggered. The issue with the failure is that it does 
not clearly show when and how it fails. Analysis of the test failures shows 
that a complex combination of input data, split and buffer size will trigger a 
failure.

Based on testing the duplication of the record occurs only if:
- the first character(s) of the delimiter are part of the record data, example: 
  1) the delimiter is {{\+=}} and the data contains a {{\+}} and is not 
followed by {{=}}
  2) the delimiter is {{\+=\+=}} and the data contains {{\+=\+}} and is not 
followed by {{=}}
- the delimiter character is found at the split boundary: the last character 
before the split ends
- a fill of the buffer is triggered to finish processing the record

The underlying problem is that we set a flag called {{needAdditionalRecord}} in 
the {{UncompressedSplitLineReader}} when we fill the buffer and have 
encountered part of a delimiter in combination with a split. We keep track of 
this in the ambiguous character number. However is it turns out that if the 
character(s) found after that point do not belong to a delimiter we do not 
unset the {{needAdditionalRecord}}. This causes the next record to be read 
twice and thus we see a duplication of records.
The solution would be to unset the flag when we detect that we're not 
processing a delimiter. We currently only add the ambiguous characters to the 
record read and set the number back to 0. At the same point we need to unset 
the flag.

The patch was developed based on junit tests that exercise the split and buffer 
settings in combination with multiple delimiter types using different inputs. 
All cases now provide a consistent count of records and correct position inside 
the data.

> multibyte delimiters with LineRecordReader cause duplicate records
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6549
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.7.2
>            Reporter: Dustin Cote
>            Assignee: Wilfred Spiegelenburg
>         Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to