[jira] [Updated] (PIG-3352) Bzip2TextInputFormat can duplicate records across splits

Jason Lowe (JIRA) Thu, 06 Jun 2013 14:20:57 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Lowe updated PIG-3352:
----------------------------

    Attachment: blockEndingInRecordWithCR.txt.bz2

Discovered this while investigating HADOOP-9622.  Attaching a modified version 
of blockEndingWithCR.txt.bz2 used in TestBZip.testBlockHeaderEndingWithCR where 
I simply moved the carriage-return character one word over.  Therefore the 
number of records is the same, but when it is processed with splits a record is 
duplicated.

This can also be seen with a simple pig script that loads and dumps the file, 
where the script is run with and without mapred.max.split.size=136498 and 
comparing the two outputs.

I would expect a bzip2-compressed file full of carriage-return-delimited 
records to exhibit this as well, since it's likely a block boundary straddles a 
record ending with just a carriage-return character.
                
> Bzip2TextInputFormat can duplicate records across splits
> --------------------------------------------------------
>
>                 Key: PIG-3352
>                 URL: https://issues.apache.org/jira/browse/PIG-3352
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.10.1
>            Reporter: Jason Lowe
>         Attachments: blockEndingInRecordWithCR.txt.bz2
>
>
> If a bz2 block boundary occurs in the middle of a record that is terminated 
> by a carriage-return then the next record will be duplicated.  The compressed 
> stream position is updated at the same time a carriage-return character is 
> seen without a subsequent line-feed character.  Based on the method of 
> reporting position within the compression stream, it incorrectly believes it 
> has read only the carriage-return character into the next compression block 
> and ends up processing the next record which will also be processed by the 
> consumer of the next split.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3352) Bzip2TextInputFormat can duplicate records across splits

Reply via email to