[jira] [Updated] (HADOOP-18400) Fix file split duplicating records from a succeeding split when reading BZip2 text files

ASF GitHub Bot (Jira) Wed, 10 Aug 2022 18:46:04 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HADOOP-18400:
------------------------------------
    Labels: pull-request-available  (was: )

>  Fix file split duplicating records from a succeeding split when reading 
> BZip2 text files 
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18400
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18400
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.3.3, 3.3.4
>            Reporter: groot
>            Assignee: groot
>            Priority: Critical
>              Labels: pull-request-available
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When a file split's range does not include the 
> start position of a BZip2 block, then it is expected to contain no records 
> (i.e. the split is empty). However, if it so happens that the end of this 
> split (exclusive) is at the start of a BZip2 block, then LineRecordReader 
> ends up returning all the records for that BZip2 block. This ends up 
> duplicating records read by a job because the next split would also end up 
> returning all the records for the same block (since its range would include 
> the start of that block).
> This bug does not get triggered when the file split's range does include the 
> start of at least one block and ends just before the start of another block. 
> The reason for this has to do with when BZip2CompressionInputStream updates 
> its position when using the BYBLOCK READMODE. Using this read mode, the 
> stream's position while reading only gets updated when reading the first byte 
> past an end of a block marker. The bug is that if the stream, when 
> initialized, was adjusted to be at the end of one block, then we don't update 
> the position after we read the first byte of the next block. Rather, we keep 
> the position to be equal to the next block marker we've initialized to. If 
> the exclusive end position of the split is equal to stream's position, 
> LineRecordReader will continue to read lines until the position is updated 
> (an an additional record in the next block is read if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-18400) Fix file split duplicating records from a succeeding split when reading BZip2 text files

Reply via email to