Currently in Hadoop you cannot split bzip2 files: <http://issues.apache.org/jira/browse/HADOOP-4012>
However, gzip files can be split: <http://issues.apache.org/jira/browse/HADOOP-437> Hope this helps. Alex On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins <[EMAIL PROTECTED]>wrote: > > > I'm seeing some strange behavior with bzip2 files and release > 0.19.0. I'm wondering if anyone can shed some light on what I'm seeing. > Basically it _looks_ like the processing of a particular bzip2 input > file is stopping after the first bzip2 block. Below is a comparison of > tests between a .gz file which seems to do what I expect, and the same > file .bz2 which doesn't behave as I expect. > > > > I have the same file stored in hadoop compressed as both bzip2 and > gz formats. The uncompressed file size is 660,841,894 bytes. Comparing > the files they both seem to be valid archives of the exact same file. > > > > /usr/local/hadoop/bin/hadoop dfs -cat > bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum > > 2c82901170f44245fb04d24ad4746e38 - > > > > /usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz > | gunzip -c | md5sum > > 2c82901170f44245fb04d24ad4746e38 - > > > > Given the md5 sums match it seems like the files are the same and > uncompress correctly. > > > > Now when I run a simple Map/Reduce application that just counts > lines in the file I get different results. > > > > Expected Results: > > > > /usr/local/hadoop/bin/hadoop dfs -cat > bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l > > 6884024 > > > > Gzip input file Results: 6,884,024 > > Bzip2 input file Results: 9,420 > > > > > > Looking at the task log files the MAP_INPUT_BYTES of the .gz file > looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and > matches the size of the uncompressed file. However, looking at > MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map > input bytes)(900000)] ) which matches the block size of the bzip2 > compressed file. So that makes me think for some reason that only the > first bzip2 block of the bzip2 compressed file is being processed. > > > > So I'm wondering if my analysis is correct and if there could be an > issue with the processing of bzip2 input files. > > > > Andy > >
