[ 
https://issues.apache.org/jira/browse/HADOOP-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578203#comment-17578203
 ] 

ASF GitHub Bot commented on HADOOP-18400:
-----------------------------------------

ashutoshcipher opened a new pull request, #4732:
URL: https://github.com/apache/hadoop/pull/4732

   ### Description of PR
   
   Fix file split duplicating records from a succeeding split when reading 
BZip2 text files.
   
   JIRA - HADOOP-18400
   
   
   ### How was this patch tested?
   
   Added Unit tests. 
   
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




>  Fix file split duplicating records from a succeeding split when reading 
> BZip2 text files 
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18400
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18400
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.3.3, 3.3.4
>            Reporter: groot
>            Assignee: groot
>            Priority: Critical
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When a file split's range does not include the 
> start position of a BZip2 block, then it is expected to contain no records 
> (i.e. the split is empty). However, if it so happens that the end of this 
> split (exclusive) is at the start of a BZip2 block, then LineRecordReader 
> ends up returning all the records for that BZip2 block. This ends up 
> duplicating records read by a job because the next split would also end up 
> returning all the records for the same block (since its range would include 
> the start of that block).
> This bug does not get triggered when the file split's range does include the 
> start of at least one block and ends just before the start of another block. 
> The reason for this has to do with when BZip2CompressionInputStream updates 
> its position when using the BYBLOCK READMODE. Using this read mode, the 
> stream's position while reading only gets updated when reading the first byte 
> past an end of a block marker. The bug is that if the stream, when 
> initialized, was adjusted to be at the end of one block, then we don't update 
> the position after we read the first byte of the next block. Rather, we keep 
> the position to be equal to the next block marker we've initialized to. If 
> the exclusive end position of the split is equal to stream's position, 
> LineRecordReader will continue to read lines until the position is updated 
> (an an additional record in the next block is read if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to