[
https://issues.apache.org/jira/browse/HADOOP-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578203#comment-17578203
]
ASF GitHub Bot commented on HADOOP-18400:
-----------------------------------------
ashutoshcipher opened a new pull request, #4732:
URL: https://github.com/apache/hadoop/pull/4732
### Description of PR
Fix file split duplicating records from a succeeding split when reading
BZip2 text files.
JIRA - HADOOP-18400
### How was this patch tested?
Added Unit tests.
### For code changes:
- [X] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
> Fix file split duplicating records from a succeeding split when reading
> BZip2 text files
> ------------------------------------------------------------------------------------------
>
> Key: HADOOP-18400
> URL: https://issues.apache.org/jira/browse/HADOOP-18400
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 3.3.3, 3.3.4
> Reporter: groot
> Assignee: groot
> Priority: Critical
>
> Fix data correctness issue with TextInputFormat that can occur when reading
> BZip2 compressed text files. When a file split's range does not include the
> start position of a BZip2 block, then it is expected to contain no records
> (i.e. the split is empty). However, if it so happens that the end of this
> split (exclusive) is at the start of a BZip2 block, then LineRecordReader
> ends up returning all the records for that BZip2 block. This ends up
> duplicating records read by a job because the next split would also end up
> returning all the records for the same block (since its range would include
> the start of that block).
> This bug does not get triggered when the file split's range does include the
> start of at least one block and ends just before the start of another block.
> The reason for this has to do with when BZip2CompressionInputStream updates
> its position when using the BYBLOCK READMODE. Using this read mode, the
> stream's position while reading only gets updated when reading the first byte
> past an end of a block marker. The bug is that if the stream, when
> initialized, was adjusted to be at the end of one block, then we don't update
> the position after we read the first byte of the next block. Rather, we keep
> the position to be equal to the next block marker we've initialized to. If
> the exclusive end position of the split is equal to stream's position,
> LineRecordReader will continue to read lines until the position is updated
> (an an additional record in the next block is read if needed).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]