[
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826352#comment-13826352
]
Chris Douglas commented on HADOOP-9622:
---------------------------------------
bq. I'm tempted to handle this as a separate JIRA since I believe this will be
an issue only with uncompressed inputs after this patch.
Yeah, that makes sense. Particularly since this issue covers the codec and the
custom delimiter bug is in in the text processing. Thanks for looking into it.
bq. With this patch I think we have this case covered for compressed input due
to the needAdditionalRecordAfterSplit logic.
I... think that's true. We can think about it in the followup.
> bzip2 codec can drop records when reading data in splits
> --------------------------------------------------------
>
> Key: HADOOP-9622
> URL: https://issues.apache.org/jira/browse/HADOOP-9622
> Project: Hadoop Common
> Issue Type: Bug
> Components: io
> Affects Versions: 2.0.4-alpha, 0.23.8
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Critical
> Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch,
> HADOOP-9622.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2
>
>
> Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when
> reading them in splits based on where record delimiters occur relative to
> compression block boundaries.
> Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.
--
This message was sent by Atlassian JIRA
(v6.1#6144)