[jira] [Commented] (MAPREDUCE-5862) Line records longer than 2x split size aren't handled correctly

Jason Lowe (JIRA) Mon, 28 Apr 2014 09:36:38 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983179#comment-13983179
 ]


Jason Lowe commented on MAPREDUCE-5862:
---------------------------------------

Note that I was not able to get the test to fail with compressed input even 
without the proposed fix.  The code already throws away the first record if it 
isn't the first split, and if that brings the reported position past the end of 
the current split then no records are reported for the split since 
getFilePosition > end on the first call to getNextValue().

The real issue is we aren't allowing a large enough read to occur for the first 
line read when using an uncompressed input.  Note that when we read a line 
during readNextValue() the max bytes to consume is computed as 
Math.max(maxBytesToConsume(pos), maxLineLength)) but when we read the first 
"throw-away" record it is just maxBytesToConsume(pos).  This isn't an issue for 
compressed input since maxBytesToConsume always returns Integer.MAX_VALUE, but 
it's problematic for uncompressed input when the split size is less than the 
maximum line length.

Changing the record read during init to match the same max bytes computation 
that readNextValue() uses allows the test to pass, and it is a simpler change.  
Arguably maxBytesToConsume() should just take maxLineLength into account 
already so others using it in the future don't make similar mistakes for tiny 
split sizes.

Couple of other comments on the patch:
- There should be a corresponding test for mapred LineRecordReader
- The test sends down 9-byte splits but moves the offset 10 bytes each time.  
Seems to me "splitSize - 1" should be "splitSize" when constructing the 
FileSplits in readRecords.

> Line records longer than 2x split size aren't handled correctly
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-5862
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5862
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: bc Wong
>            Priority: Critical
>         Attachments: 0001-Handle-records-larger-than-2x-split-size.patch, 
> 0001-Handle-records-larger-than-2x-split-size.patch, 
> recordSpanningMultipleSplits.txt.bz2
>
>
> Suppose this split (100-200) is in the middle of a record (90-240):
> {noformat}
>    0              100            200             300
>    |---- split ----|---- curr ----|---- split ----|
>                  <------- record ------->
>                  90                     240
> {noformat}
>       
> Currently, the first split would read the entire record, up to offset 240, 
> which is good. But the 2nd split has a bug in producing a phantom record of 
> (200, 240).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5862) Line records longer than 2x split size aren't handled correctly

Reply via email to