Re: Strange behavior with bzip2 input files w/release 0.19.0

John Heidemann Fri, 05 Dec 2008 13:20:15 -0800

On Thu, 04 Dec 2008 09:55:35 PST, "Alex Loddengaard" wrote: 
>Currently in Hadoop you cannot split bzip2 files:
>
><http://issues.apache.org/jira/browse/HADOOP-4012>
>
>However, gzip files can be split:
>
><http://issues.apache.org/jira/browse/HADOOP-437>
>
>Hope this helps.


Just to clarify, gzip files are only sort of split---it's only one file
per "split", not many splits per file.  For many of our datasets we have
only a few large files, so this level of split support is a serious
limitation to parallelism.  THis limitation is (I believe)
fundamental to gzip where the decompression state is never checkpointed.

This limitation is what prompted us to add support for bzip2 and bzip2
splitting, although splitting support is only in progress as Abdul said.

   -John Heidemann

>
>Alex
>
>On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins <[EMAIL PROTECTED]>wrote:
>
>>
>>
>>    I'm seeing some strange behavior with bzip2 files and release
>> 0.19.0.  I'm wondering if anyone can shed some light on what I'm seeing.
>> Basically it _looks_ like the processing of a particular bzip2 input
>> file is stopping after the first bzip2 block.  Below is a comparison of
>> tests  between a .gz file which seems to do what I expect, and the same
>> file .bz2 which doesn't behave as I expect.
>>
>>
>>
>>    I have the same file stored in hadoop compressed as both bzip2 and
>> gz formats.  The uncompressed file size is 660,841,894 bytes.  Comparing
>> the files they both seem to be valid archives of the exact same file.
>>
>>
>>
>> /usr/local/hadoop/bin/hadoop dfs -cat
>> bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum
>>
>> 2c82901170f44245fb04d24ad4746e38  -
>>
>>
>>
>> /usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz
>> | gunzip -c | md5sum
>>
>> 2c82901170f44245fb04d24ad4746e38  -
>>
>>
>>
>>    Given the md5 sums match it seems like the files are the same and
>> uncompress correctly.
>>
>>
>>
>>    Now when I run a simple Map/Reduce application that just counts
>> lines in the file I get different results.
>>
>>
>>
>>  Expected Results:
>>
>>
>>
>>  /usr/local/hadoop/bin/hadoop dfs -cat
>> bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l
>>
>> 6884024
>>
>>
>>
>>   Gzip input file Results: 6,884,024
>>
>>   Bzip2 input file Results: 9,420
>>
>>
>>
>>
>>
>>   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
>> looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
>> matches the size of the uncompressed file.  However, looking at
>> MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
>> input bytes)(900000)] ) which matches the block size of the bzip2
>> compressed file.  So that makes me think for some reason that only the
>> first bzip2 block of the bzip2 compressed file is being processed.
>>
>>
>>
>>    So I'm wondering if my analysis is correct and if there could be an
>> issue with the processing of bzip2 input files.
>>
>>
>>
>>   Andy
>>
>>

Re: Strange behavior with bzip2 input files w/release 0.19.0

Reply via email to