Currently in Hadoop you cannot split bzip2 files:

<http://issues.apache.org/jira/browse/HADOOP-4012>

However, gzip files can be split:

<http://issues.apache.org/jira/browse/HADOOP-437>

Hope this helps.

Alex

On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins <[EMAIL PROTECTED]>wrote:

>
>
>    I'm seeing some strange behavior with bzip2 files and release
> 0.19.0.  I'm wondering if anyone can shed some light on what I'm seeing.
> Basically it _looks_ like the processing of a particular bzip2 input
> file is stopping after the first bzip2 block.  Below is a comparison of
> tests  between a .gz file which seems to do what I expect, and the same
> file .bz2 which doesn't behave as I expect.
>
>
>
>    I have the same file stored in hadoop compressed as both bzip2 and
> gz formats.  The uncompressed file size is 660,841,894 bytes.  Comparing
> the files they both seem to be valid archives of the exact same file.
>
>
>
> /usr/local/hadoop/bin/hadoop dfs -cat
> bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum
>
> 2c82901170f44245fb04d24ad4746e38  -
>
>
>
> /usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz
> | gunzip -c | md5sum
>
> 2c82901170f44245fb04d24ad4746e38  -
>
>
>
>    Given the md5 sums match it seems like the files are the same and
> uncompress correctly.
>
>
>
>    Now when I run a simple Map/Reduce application that just counts
> lines in the file I get different results.
>
>
>
>  Expected Results:
>
>
>
>  /usr/local/hadoop/bin/hadoop dfs -cat
> bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l
>
> 6884024
>
>
>
>   Gzip input file Results: 6,884,024
>
>   Bzip2 input file Results: 9,420
>
>
>
>
>
>   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
> looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
> matches the size of the uncompressed file.  However, looking at
> MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
> input bytes)(900000)] ) which matches the block size of the bzip2
> compressed file.  So that makes me think for some reason that only the
> first bzip2 block of the bzip2 compressed file is being processed.
>
>
>
>    So I'm wondering if my analysis is correct and if there could be an
> issue with the processing of bzip2 input files.
>
>
>
>   Andy
>
>

Reply via email to