[jira] [Updated] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

Ashutosh Gupta (Jira) Thu, 30 Jun 2022 09:21:05 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ashutosh Gupta updated HADOOP-18321:
------------------------------------
    Description: 
Fix data correctness issue with TextInputFormat that can occur when reading 
BZip2 compressed text files. When triggered this bug would cause a split to 
return the first record of the succeeding split that reads the next BZip2 
block, thereby duplicating that record.

*When the bug is triggered?*

The condition for the bug to occur requires that the flag 
"needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
#fillBuffer at an inappropriate time: when we haven't read the remaining bytes 
of split. This can happen when the inDelimiter parameter is true while 
#fillBuffer is invoked while reading the next line. The inDelimiter parameter 
is true when either 1) the last byte of the buffer is a CR character ('\r') if 
using the default delimiters, or 2) the last bytes of the buffer are a common 
prefix of the delimiter if using a custom delimiter.

This can occur in various edge cases, illustrated by five unit tests added in 
this change -- specifically the five that would fail without the fix are as 
listed below:
 # 
BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
 # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
 # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
 # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
 # 
BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter

For background, the purpose of "needAdditionalRecord" field in 
CompressedSplitLineReader is to indicate to LineRecordReader via the 
#needAdditionalRecordAfterSplit method that an extra record lying beyond the 
split range should be included in the split. This complication arises due to a 
problem when splitting text files. When a split starts at a position greater 
than zero, we do not know whether the first line belongs to the last record in 
the prior split or is a new record. The solution done in Hadoop is to make 
splits that start at position greater than zero to always discard the first 
line and then have the prior split decide whether it should include the first 
line of the next split or not (as part of the last record or as a new record). 
This works well even in the case of a single line spanning multiple splits.

*What is the fix?*

The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
to the buffer are not the bytes immediately outside the range of the split.

When reading compressed data, CompressedSplitLineReader requires/assumes that 
the stream's #read method never returns bytes from more than one compression 
block at a time. This ensures that #fillBuffer gets invoked to read the first 
byte of the next block. This next block may or may not be part of the split we 
are reading. If we detect that the last bytes of the prior block maybe part of 
a delimiter, then we may decide that we should read an additional record, but 
we should only do that when this next block is not part of our split *and* we 
aren't filling the buffer again beyond our split range. This is because we are 
only concerned whether the we need to read the very first line of the next 
split as a separate record. If it going to be part of the last record, then we 
don't need to read an extra record, or in the special case of CR + LF (i.e. 
"\r\n"), if the LF is the first byte of the next split, it will be treated as 
an empty line, thus we don't need to include an extra record into the mix.

Thus, to emphasize. It is when we read the first bytes outside our split range 
that matters. But the current logic doesn't take that into account in 
CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader 
which does.

  was:
Fix data correctness issue with TextInputFormat that can occur when reading 
BZip2 compressed text files. When triggered this bug would cause a split to 
return the first record of the succeeding split that reads the next BZip2 
block, thereby duplicating that record.

*When the bug is triggered?*

The condition for the bug to occur requires that the flag 
"needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
#fillBuffer at an inappropriate time: when we haven't read the remaining bytes 
of split. This can happen when the inDelimiter parameter is true while 
#fillBuffer is invoked while reading the next line. The inDelimiter parameter 
is true when either 1) the last byte of the buffer is a CR character ('\r') if 
using the default delimiters, or 2) the last bytes of the buffer are a common 
prefix of the delimiter if using a custom delimiter.


> Fix when to read an additional record from a BZip2 text file split
> ------------------------------------------------------------------
>
>                 Key: HADOOP-18321
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18321
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.3.3
>            Reporter: Ashutosh Gupta
>            Assignee: Ashutosh Gupta
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that 
> the stream's #read method never returns bytes from more than one compression 
> block at a time. This ensures that #fillBuffer gets invoked to read the first 
> byte of the next block. This next block may or may not be part of the split 
> we are reading. If we detect that the last bytes of the prior block maybe 
> part of a delimiter, then we may decide that we should read an additional 
> record, but we should only do that when this next block is not part of our 
> split *and* we aren't filling the buffer again beyond our split range. This 
> is because we are only concerned whether the we need to read the very first 
> line of the next split as a separate record. If it going to be part of the 
> last record, then we don't need to read an extra record, or in the special 
> case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, 
> it will be treated as an empty line, thus we don't need to include an extra 
> record into the mix.
> Thus, to emphasize. It is when we read the first bytes outside our split 
> range that matters. But the current logic doesn't take that into account in 
> CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader 
> which does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

Reply via email to