[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805125#comment-14805125
 ] 

zhihai xu commented on MAPREDUCE-6481:
--------------------------------------

[~jlowe], thanks for the review and committing the patch! This patch will 
depend on MAPREDUCE-5948, I can apply the patch cleanly after apply 
MAPREDUCE-5948. Shall we add both MAPREDUCE-5948 and MAPREDUCE-6481 to 2.7.2 
release?

> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6481
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6481
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>             Fix For: 2.8.0
>
>         Attachments: MAPREDUCE-6481.000.patch
>
>
> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> There are two issues:
> # LineRecordReader may give incomplete record: some characters cut off at the 
> end of record.
> # LineRecordReader may give wrong position/key information.
> The first issue only happens for Custom Delimiter, which is caused by the 
> following code at {{LineReader#readCustomLine}}:
> {code}
>     if (appendLength > 0) {
>         if (ambiguousByteCount > 0) {
>           str.append(recordDelimiterBytes, 0, ambiguousByteCount);
>           //appending the ambiguous characters (refer case 2.2)
>           bytesConsumed += ambiguousByteCount;
>           ambiguousByteCount=0;
>         }
>         str.append(buffer, startPosn, appendLength);
>         txtLength += appendLength;
>       }
> {code}
> If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will 
> be triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
> bufferSize is 10 and splitLength is 12, the correct record should be 
> "123456789a" with length 10, but we get incomplete record "123456789" with 
> length 9 from current code.
> The second issue can happen for both Custom Delimiter and Default Delimiter, 
> which is caused by the code in {{UncompressedSplitLineReader#readLine}}. 
> {{UncompressedSplitLineReader#readLine}} may report wrong size information at 
> some corner cases. The reason is {{unusedBytes}} in the following code:
> {code}
> bytesRead += unusedBytes;
> unusedBytes = bufferSize - getBufferPosn();
> bytesRead -= unusedBytes;
> {code}
> If the last bytes read (bufferLength) is less than bufferSize, the previous 
> {{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
> {{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
> value.
> For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
> bufferSize is 10 and two splits:first splitLength is 15 and second 
> splitLength 4:
> the current code will give the following result:
> First record: Key:0 Value:"1234567890"
> Second record: Key:12 Value:"12"
> Third Record: Key:21 Value:"345"
> You can see the Key for the third record is wrong, it should be 16 instead of 
> 21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
> first time, for the second times, it only read 5 bytes, which is 5 bytes less 
> than the bufferSize. That is why the key we get is 5 bytes larger than the 
> correct one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to