[
https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371917#comment-16371917
]
Zsolt Venczel commented on HADOOP-6852:
---------------------------------------
Thanks for checking [~mackrorysd]!
The binary files were first added into git by commit:
a196766ea07775f18ded69bd9e8d239f8cfd3ccc as a restructuring described by
HADOOP-7106. These were present prior to that in MR SVN repository I assume.
> apparent bug in concatenated-bzip2 support (decoding)
> -----------------------------------------------------
>
> Key: HADOOP-6852
> URL: https://issues.apache.org/jira/browse/HADOOP-6852
> Project: Hadoop Common
> Issue Type: Bug
> Components: io
> Affects Versions: 0.22.0
> Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15
> Reporter: Greg Roelofs
> Assignee: Zsolt Venczel
> Priority: Major
> Attachments: HADOOP-6852.01.patch, HADOOP-6852.02.patch,
> HADOOP-6852.03.patch, HADOOP-6852.04.patch
>
>
> The following simplified code (manually picked out of testMoreBzip2() in
> https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch)
> triggers a "java.io.IOException: bad block header" in
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(
> CBZip2InputStream.java:527):
> {noformat}
> JobConf jobConf = new JobConf(defaultConf);
> CompressionCodec bzip2 = new BZip2Codec();
> ReflectionUtils.setConf(bzip2, jobConf);
> localFs.delete(workDir, true);
> // copy multiple-member test file to HDFS
> String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension();
> Path fnLocal2 = new
> Path(System.getProperty("test.concat.data","/tmp"),fn2);
> Path fnHDFS2 = new Path(workDir, fn2);
> localFs.copyFromLocalFile(fnLocal2, fnHDFS2);
> FileInputFormat.setInputPaths(jobConf, workDir);
> final FileInputStream in2 = new FileInputStream(fnLocal2.toString());
> CompressionInputStream cin2 = bzip2.createInputStream(in2);
> LineReader in = new LineReader(cin2);
> Text out = new Text();
> int numBytes, totalBytes=0, lineNum=0;
> while ((numBytes = in.readLine(out)) > 0) {
> ++lineNum;
> totalBytes += numBytes;
> }
> in.close();
> {noformat}
> The specified file is also included in the H-6835 patch linked above, and
> some additional debug output is included in the commented-out test loop
> above. (Only in the linked, "v4" version of the patch, however--I'm about to
> remove the debug stuff for checkin.)
> It's possible I've done something completely boneheaded here, but the file,
> at least, checks out in a subsequent set of subtests and with stock bzip2
> itself. Only the code above is problematic; it reads through the first
> concatenated chunk (17 lines of text) just fine but chokes on the header of
> the second one. Altogether, the test file contains 84 lines of text and 4
> concatenated bzip2 files.
> (It's possible this is a mapreduce issue rather than common, but note that
> the identical gzip test works fine. Possibly it's related to the
> stream-vs-decompressor dichotomy, though; intentionally not supported?)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]