groot created HADOOP-18400:
------------------------------
Summary: Fix file split duplicating records from a succeeding
split when reading BZip2 text files
Key: HADOOP-18400
URL: https://issues.apache.org/jira/browse/HADOOP-18400
Project: Hadoop Common
Issue Type: Bug
Affects Versions: 3.3.3, 3.3.4
Reporter: groot
Assignee: groot
Fix data correctness issue with TextInputFormat that can occur when reading
BZip2 compressed text files. When a file split's range does not include the
start position of a BZip2 block, then it is expected to contain no records
(i.e. the split is empty). However, if it so happens that the end of this split
(exclusive) is at the start of a BZip2 block, then LineRecordReader ends up
returning all the records for that BZip2 block. This ends up duplicating
records read by a job because the next split would also end up returning all
the records for the same block (since its range would include the start of that
block).
This bug does not get triggered when the file split's range does include the
start of at least one block and ends just before the start of another block.
The reason for this has to do with when BZip2CompressionInputStream updates its
position when using the BYBLOCK READMODE. Using this read mode, the stream's
position while reading only gets updated when reading the first byte past an
end of a block marker. The bug is that if the stream, when initialized, was
adjusted to be at the end of one block, then we don't update the position after
we read the first byte of the next block. Rather, we keep the position to be
equal to the next block marker we've initialized to. If the exclusive end
position of the split is equal to stream's position, LineRecordReader will
continue to read lines until the position is updated (an an additional record
in the next block is read if needed).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]