[ 
https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903063#action_12903063
 ] 

Wouter de Bie commented on HADOOP-6852:
---------------------------------------

I'm having the same problem. We write files from our logging application using 
the BZip2Codec, but our application rotates files on an hourly basis. Next to 
that, when we flush log lines, we compress, which in turn causes different 
bzip2 block sizes.
Anyway, we're also getting the 'bad block header' exception.
Looking through some code and trying different things, we get things to work 
when we change the following in 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream:

{noformat}
  public CBZip2InputStream(final InputStream in) throws IOException {
    this(in, READ_MODE.CONTINUOUS);
  }
{noformat}

to

{noformat}
  public CBZip2InputStream(final InputStream in) throws IOException {
    this(in, READ_MODE.BYBLOCK);
  }
{noformat}

Which causes CBZip2InputStream to use the block size in initBlock(), instead of 
looking for the magic numbers.

I'll be digging into some code more tomorrow, but to quickly move forward on 
this issue, I would like to know why is CONTINUOUS the default mode and is 
there a place where the read mode is determined? BZip2 is block based, so why 
not default to that? 

I'll also try the above piece of code and see if that works with BYBLOCK.



> apparent bug in concatenated-bzip2 support (decoding)
> -----------------------------------------------------
>
>                 Key: HADOOP-6852
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6852
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.22.0
>         Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15
>            Reporter: Greg Roelofs
>
> The following simplified code (manually picked out of testMoreBzip2() in 
> https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch)
>  triggers a "java.io.IOException: bad block header" in 
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( 
> CBZip2InputStream.java:527):
> {noformat}
>     JobConf jobConf = new JobConf(defaultConf);
>     CompressionCodec bzip2 = new BZip2Codec();
>     ReflectionUtils.setConf(bzip2, jobConf);
>     localFs.delete(workDir, true);
>     // copy multiple-member test file to HDFS
>     String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension();
>     Path fnLocal2 = new 
> Path(System.getProperty("test.concat.data","/tmp"),fn2);
>     Path fnHDFS2  = new Path(workDir, fn2);
>     localFs.copyFromLocalFile(fnLocal2, fnHDFS2);
>     FileInputFormat.setInputPaths(jobConf, workDir);
>     final FileInputStream in2 = new FileInputStream(fnLocal2.toString());
>     CompressionInputStream cin2 = bzip2.createInputStream(in2);
>     LineReader in = new LineReader(cin2);
>     Text out = new Text();
>     int numBytes, totalBytes=0, lineNum=0;
>     while ((numBytes = in.readLine(out)) > 0) {
>       ++lineNum;
>       totalBytes += numBytes;
>     }
>     in.close();
> {noformat}
> The specified file is also included in the H-6835 patch linked above, and 
> some additional debug output is included in the commented-out test loop 
> above.  (Only in the linked, "v4" version of the patch, however--I'm about to 
> remove the debug stuff for checkin.)
> It's possible I've done something completely boneheaded here, but the file, 
> at least, checks out in a subsequent set of subtests and with stock bzip2 
> itself.  Only the code above is problematic; it reads through the first 
> concatenated chunk (17 lines of text) just fine but chokes on the header of 
> the second one.  Altogether, the test file contains 84 lines of text and 4 
> concatenated bzip2 files.
> (It's possible this is a mapreduce issue rather than common, but note that 
> the identical gzip test works fine.  Possibly it's related to the 
> stream-vs-decompressor dichotomy, though; intentionally not supported?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to