[ https://issues.apache.org/jira/browse/HADOOP-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zsolt Venczel updated HADOOP-6852: ---------------------------------- Attachment: HADOOP-6852.01.patch > apparent bug in concatenated-bzip2 support (decoding) > ----------------------------------------------------- > > Key: HADOOP-6852 > URL: https://issues.apache.org/jira/browse/HADOOP-6852 > Project: Hadoop Common > Issue Type: Bug > Components: io > Affects Versions: 0.22.0 > Environment: Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15 > Reporter: Greg Roelofs > Assignee: Zsolt Venczel > Priority: Major > Attachments: HADOOP-6852.01.patch > > > The following simplified code (manually picked out of testMoreBzip2() in > https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) > triggers a "java.io.IOException: bad block header" in > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( > CBZip2InputStream.java:527): > {noformat} > JobConf jobConf = new JobConf(defaultConf); > CompressionCodec bzip2 = new BZip2Codec(); > ReflectionUtils.setConf(bzip2, jobConf); > localFs.delete(workDir, true); > // copy multiple-member test file to HDFS > String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension(); > Path fnLocal2 = new > Path(System.getProperty("test.concat.data","/tmp"),fn2); > Path fnHDFS2 = new Path(workDir, fn2); > localFs.copyFromLocalFile(fnLocal2, fnHDFS2); > FileInputFormat.setInputPaths(jobConf, workDir); > final FileInputStream in2 = new FileInputStream(fnLocal2.toString()); > CompressionInputStream cin2 = bzip2.createInputStream(in2); > LineReader in = new LineReader(cin2); > Text out = new Text(); > int numBytes, totalBytes=0, lineNum=0; > while ((numBytes = in.readLine(out)) > 0) { > ++lineNum; > totalBytes += numBytes; > } > in.close(); > {noformat} > The specified file is also included in the H-6835 patch linked above, and > some additional debug output is included in the commented-out test loop > above. (Only in the linked, "v4" version of the patch, however--I'm about to > remove the debug stuff for checkin.) > It's possible I've done something completely boneheaded here, but the file, > at least, checks out in a subsequent set of subtests and with stock bzip2 > itself. Only the code above is problematic; it reads through the first > concatenated chunk (17 lines of text) just fine but chokes on the header of > the second one. Altogether, the test file contains 84 lines of text and 4 > concatenated bzip2 files. > (It's possible this is a mapreduce issue rather than common, but note that > the identical gzip test works fine. Possibly it's related to the > stream-vs-decompressor dichotomy, though; intentionally not supported?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org