[
https://issues.apache.org/jira/browse/PIG-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated PIG-3352:
----------------------------
Attachment: blockEndingInRecordWithCR.txt.bz2
Discovered this while investigating HADOOP-9622. Attaching a modified version
of blockEndingWithCR.txt.bz2 used in TestBZip.testBlockHeaderEndingWithCR where
I simply moved the carriage-return character one word over. Therefore the
number of records is the same, but when it is processed with splits a record is
duplicated.
This can also be seen with a simple pig script that loads and dumps the file,
where the script is run with and without mapred.max.split.size=136498 and
comparing the two outputs.
I would expect a bzip2-compressed file full of carriage-return-delimited
records to exhibit this as well, since it's likely a block boundary straddles a
record ending with just a carriage-return character.
> Bzip2TextInputFormat can duplicate records across splits
> --------------------------------------------------------
>
> Key: PIG-3352
> URL: https://issues.apache.org/jira/browse/PIG-3352
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.10.1
> Reporter: Jason Lowe
> Attachments: blockEndingInRecordWithCR.txt.bz2
>
>
> If a bz2 block boundary occurs in the middle of a record that is terminated
> by a carriage-return then the next record will be duplicated. The compressed
> stream position is updated at the same time a carriage-return character is
> seen without a subsequent line-feed character. Based on the method of
> reporting position within the compression stream, it incorrectly believes it
> has read only the carriage-return character into the next compression block
> and ends up processing the next record which will also be processed by the
> consumer of the next split.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira