[
https://issues.apache.org/jira/browse/PIG-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Laukik Chitnis updated PIG-1304:
--------------------------------
Status: Patch Available (was: Open)
Support for concatenated gzip files is dependent on hadoop core, and is
available since HADOOP-6835 is fixed. For loading bz2, Pig implements a
separate InputFormat, and this was silently ignoring the concatenated data when
the input to be loaded is a concatenated bz2 file. Uploading a patch that makes
the bz2 stream reader to raise an exception if it encounters additional data
past the crc block.
> Fail underlying M/R jobs when concatenated gzip and bz2 files are provided as
> input
> -----------------------------------------------------------------------------------
>
> Key: PIG-1304
> URL: https://issues.apache.org/jira/browse/PIG-1304
> Project: Pig
> Issue Type: New Feature
> Affects Versions: 0.6.0
> Reporter: Viraj Bhat
> Assignee: Laukik Chitnis
> Fix For: 0.9.0
>
> Attachments: patch-PIG-1304-1
>
>
> I have the following txt files which are bzipped: \t =<TAB>
> {code}
> $ bzcat A.txt.bz2
> 1\ta
> 2\taa
> $bzcat B.txt.bz2
> 1\tb
> 2\tbb
> $cat *.bz2 > test/mymerge.bz2
> $bzcat test/mymerge.bz2
> 1\ta
> 2\taa
> 1\tb
> 2\tbb
> $hadoop fs -put test/mymerge.bz2 /user/viraj
> {code}
> I now write a Pig script to print values of bz2.
> {code}
> A = load '/user/viraj/bzipgetmerge/mymerge.bz2' using PigStorage();
> dump A;
> {code}
> I get the records for the first bz2 file which I concatenated.
> (1,a)
> (2,aa)
> My M/R jobs do not fail or throw any warning about this, just that it drops
> records. Is there a way we can throw a warning or fail the underlying Map
> job, can it be done in Bzip2TextInputFormat class in Pig ?
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira