[
https://issues.apache.org/jira/browse/MAPREDUCE-5862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983179#comment-13983179
]
Jason Lowe commented on MAPREDUCE-5862:
---------------------------------------
Note that I was not able to get the test to fail with compressed input even
without the proposed fix. The code already throws away the first record if it
isn't the first split, and if that brings the reported position past the end of
the current split then no records are reported for the split since
getFilePosition > end on the first call to getNextValue().
The real issue is we aren't allowing a large enough read to occur for the first
line read when using an uncompressed input. Note that when we read a line
during readNextValue() the max bytes to consume is computed as
Math.max(maxBytesToConsume(pos), maxLineLength)) but when we read the first
"throw-away" record it is just maxBytesToConsume(pos). This isn't an issue for
compressed input since maxBytesToConsume always returns Integer.MAX_VALUE, but
it's problematic for uncompressed input when the split size is less than the
maximum line length.
Changing the record read during init to match the same max bytes computation
that readNextValue() uses allows the test to pass, and it is a simpler change.
Arguably maxBytesToConsume() should just take maxLineLength into account
already so others using it in the future don't make similar mistakes for tiny
split sizes.
Couple of other comments on the patch:
- There should be a corresponding test for mapred LineRecordReader
- The test sends down 9-byte splits but moves the offset 10 bytes each time.
Seems to me "splitSize - 1" should be "splitSize" when constructing the
FileSplits in readRecords.
> Line records longer than 2x split size aren't handled correctly
> ---------------------------------------------------------------
>
> Key: MAPREDUCE-5862
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5862
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.3.0
> Reporter: bc Wong
> Assignee: bc Wong
> Priority: Critical
> Attachments: 0001-Handle-records-larger-than-2x-split-size.patch,
> 0001-Handle-records-larger-than-2x-split-size.patch,
> recordSpanningMultipleSplits.txt.bz2
>
>
> Suppose this split (100-200) is in the middle of a record (90-240):
> {noformat}
> 0 100 200 300
> |---- split ----|---- curr ----|---- split ----|
> <------- record ------->
> 90 240
> {noformat}
>
> Currently, the first split would read the entire record, up to offset 240,
> which is good. But the 2nd split has a bug in producing a phantom record of
> (200, 240).
--
This message was sent by Atlassian JIRA
(v6.2#6252)