[
https://issues.apache.org/jira/browse/HADOOP-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ashutosh Gupta updated HADOOP-18321:
------------------------------------
Description:
Fix data correctness issue with TextInputFormat that can occur when reading
BZip2 compressed text files. When triggered this bug would cause a split to
return the first record of the succeeding split that reads the next BZip2
block, thereby duplicating that record.
*When the bug is triggered?*
The condition for the bug to occur requires that the flag
"needAdditionalRecord" in CompressedSplitLineReader to be set to true by
#fillBuffer at an inappropriate time: when we haven't read the remaining bytes
of split. This can happen when the inDelimiter parameter is true while
#fillBuffer is invoked while reading the next line. The inDelimiter parameter
is true when either 1) the last byte of the buffer is a CR character ('\r') if
using the default delimiters, or 2) the last bytes of the buffer are a common
prefix of the delimiter if using a custom delimiter.
This can occur in various edge cases, illustrated by five unit tests added in
this change -- specifically the five that would fail without the fix are as
listed below:
#
BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
# BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
# BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
# BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
#
BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
For background, the purpose of "needAdditionalRecord" field in
CompressedSplitLineReader is to indicate to LineRecordReader via the
#needAdditionalRecordAfterSplit method that an extra record lying beyond the
split range should be included in the split. This complication arises due to a
problem when splitting text files. When a split starts at a position greater
than zero, we do not know whether the first line belongs to the last record in
the prior split or is a new record. The solution done in Hadoop is to make
splits that start at position greater than zero to always discard the first
line and then have the prior split decide whether it should include the first
line of the next split or not (as part of the last record or as a new record).
This works well even in the case of a single line spanning multiple splits.
*What is the fix?*
The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled
to the buffer are not the bytes immediately outside the range of the split.
When reading compressed data, CompressedSplitLineReader requires/assumes that
the stream's #read method never returns bytes from more than one compression
block at a time. This ensures that #fillBuffer gets invoked to read the first
byte of the next block. This next block may or may not be part of the split we
are reading. If we detect that the last bytes of the prior block maybe part of
a delimiter, then we may decide that we should read an additional record, but
we should only do that when this next block is not part of our split *and* we
aren't filling the buffer again beyond our split range. This is because we are
only concerned whether the we need to read the very first line of the next
split as a separate record. If it going to be part of the last record, then we
don't need to read an extra record, or in the special case of CR + LF (i.e.
"\r\n"), if the LF is the first byte of the next split, it will be treated as
an empty line, thus we don't need to include an extra record into the mix.
Thus, to emphasize. It is when we read the first bytes outside our split range
that matters. But the current logic doesn't take that into account in
CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader
which does.
was:
Fix data correctness issue with TextInputFormat that can occur when reading
BZip2 compressed text files. When triggered this bug would cause a split to
return the first record of the succeeding split that reads the next BZip2
block, thereby duplicating that record.
*When the bug is triggered?*
The condition for the bug to occur requires that the flag
"needAdditionalRecord" in CompressedSplitLineReader to be set to true by
#fillBuffer at an inappropriate time: when we haven't read the remaining bytes
of split. This can happen when the inDelimiter parameter is true while
#fillBuffer is invoked while reading the next line. The inDelimiter parameter
is true when either 1) the last byte of the buffer is a CR character ('\r') if
using the default delimiters, or 2) the last bytes of the buffer are a common
prefix of the delimiter if using a custom delimiter.
> Fix when to read an additional record from a BZip2 text file split
> ------------------------------------------------------------------
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 3.3.3
> Reporter: Ashutosh Gupta
> Assignee: Ashutosh Gupta
> Priority: Critical
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading
> BZip2 compressed text files. When triggered this bug would cause a split to
> return the first record of the succeeding split that reads the next BZip2
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by
> #fillBuffer at an inappropriate time: when we haven't read the remaining
> bytes of split. This can happen when the inDelimiter parameter is true while
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter
> is true when either 1) the last byte of the buffer is a CR character ('\r')
> if using the default delimiters, or 2) the last bytes of the buffer are a
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in
> this change -- specifically the five that would fail without the fix are as
> listed below:
> #
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
> # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
> # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
> # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
> #
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in
> CompressedSplitLineReader is to indicate to LineRecordReader via the
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the
> split range should be included in the split. This complication arises due to
> a problem when splitting text files. When a split starts at a position
> greater than zero, we do not know whether the first line belongs to the last
> record in the prior split or is a new record. The solution done in Hadoop is
> to make splits that start at position greater than zero to always discard the
> first line and then have the prior split decide whether it should include the
> first line of the next split or not (as part of the last record or as a new
> record). This works well even in the case of a single line spanning multiple
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that
> the stream's #read method never returns bytes from more than one compression
> block at a time. This ensures that #fillBuffer gets invoked to read the first
> byte of the next block. This next block may or may not be part of the split
> we are reading. If we detect that the last bytes of the prior block maybe
> part of a delimiter, then we may decide that we should read an additional
> record, but we should only do that when this next block is not part of our
> split *and* we aren't filling the buffer again beyond our split range. This
> is because we are only concerned whether the we need to read the very first
> line of the next split as a separate record. If it going to be part of the
> last record, then we don't need to read an extra record, or in the special
> case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split,
> it will be treated as an empty line, thus we don't need to include an extra
> record into the mix.
> Thus, to emphasize. It is when we read the first bytes outside our split
> range that matters. But the current logic doesn't take that into account in
> CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader
> which does.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]