[
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junping Du updated MAPREDUCE-6481:
----------------------------------
Fix Version/s: 2.8.0
> LineRecordReader may give incomplete record and wrong position/key
> information for uncompressed input sometimes.
> ----------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6481
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6481
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.7.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Priority: Critical
> Fix For: 2.8.0, 2.7.2, 2.6.3, 3.0.0-alpha1
>
> Attachments: MAPREDUCE-6481.000.patch
>
>
> LineRecordReader may give incomplete record and wrong position/key
> information for uncompressed input sometimes.
> There are two issues:
> # LineRecordReader may give incomplete record: some characters cut off at the
> end of record.
> # LineRecordReader may give wrong position/key information.
> The first issue only happens for Custom Delimiter, which is caused by the
> following code at {{LineReader#readCustomLine}}:
> {code}
> if (appendLength > 0) {
> if (ambiguousByteCount > 0) {
> str.append(recordDelimiterBytes, 0, ambiguousByteCount);
> //appending the ambiguous characters (refer case 2.2)
> bytesConsumed += ambiguousByteCount;
> ambiguousByteCount=0;
> }
> str.append(buffer, startPosn, appendLength);
> txtLength += appendLength;
> }
> {code}
> If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will
> be triggered. For example, input is "123456789aab", Custom Delimiter is "ab",
> bufferSize is 10 and splitLength is 12, the correct record should be
> "123456789a" with length 10, but we get incomplete record "123456789" with
> length 9 from current code.
> The second issue can happen for both Custom Delimiter and Default Delimiter,
> which is caused by the code in {{UncompressedSplitLineReader#readLine}}.
> {{UncompressedSplitLineReader#readLine}} may report wrong size information at
> some corner cases. The reason is {{unusedBytes}} in the following code:
> {code}
> bytesRead += unusedBytes;
> unusedBytes = bufferSize - getBufferPosn();
> bytesRead -= unusedBytes;
> {code}
> If the last bytes read (bufferLength) is less than bufferSize, the previous
> {{unusedBytes}} will be wrong, which should be {{bufferLength}} -
> {{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger
> value.
> For example, input is "1234567890ab12ab345", Custom Delimiter is "ab",
> bufferSize is 10 and two splits:first splitLength is 15 and second
> splitLength 4:
> the current code will give the following result:
> First record: Key:0 Value:"1234567890"
> Second record: Key:12 Value:"12"
> Third Record: Key:21 Value:"345"
> You can see the Key for the third record is wrong, it should be 16 instead of
> 21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the
> first time, for the second times, it only read 5 bytes, which is 5 bytes less
> than the bufferSize. That is why the key we get is 5 bytes larger than the
> correct one.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]