[ https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Douglas updated HADOOP-4012: ---------------------------------- Fix Version/s: (was: 0.20.0) Status: Open (was: Patch Available) {quote} The following change was done in this new patch. Before this change, getPos() was returning values one less than what it should be. Similarly available() method was returning -1 because the value of count becomes -1 at the end of the chunk. {quote} Should this change be part of a separate issue, then? I'm not sure what you mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new features if the fix is unaffected by the feature. * This modifies TestMultipleCacheFiles to append a newline at the end of the file. Why is this necessary? Is this the same problem as HADOOP-4182? * Pushing the READ_MODE abstraction (and the new createInputStream) into the CompressionCodec interface, particularly when only bzip supports it, is inappropriate. If it's applicable to codecs other than bzip, it should be a separate interface (extending CompressionCodec?). This would also let instanceof replace canDecompressSplitInput and move seekBackwards to the new interface. Can you describe what it means for a codec to implement this superset of functions? * This patch incorporates HADOOP-4010: {noformat} - while (pos < end) { + // We always read one extra line, which lies outside the upper + // split limit i.e. (end - 1) + pos = this.getPos(); + + while (pos <= end) { {noformat} {noformat} + // If this is not the first split, we always throw away first record + // because we always (except the last split) read one extra line in + // next() method. {noformat} Shouldn't this remain with the original JIRA? Are the issues raised there addressed in this patch? * Does this add the Seekable interface to CompressionInputStream only to support getPos() for LineRecordReader? This affects too many core components to make the feature freeze for 0.20 (Fri). > Providing splitting support for bzip2 compressed files > ------------------------------------------------------ > > Key: HADOOP-4012 > URL: https://issues.apache.org/jira/browse/HADOOP-4012 > Project: Hadoop Core > Issue Type: New Feature > Components: io > Reporter: Abdul Qadeer > Assignee: Abdul Qadeer > Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch, > Hadoop-4012-version3.patch, Hadoop-4012-version4.patch > > > Hadoop assumes that if the input data is compressed, it can not be split > (mainly due to the limitation of many codecs that they need the whole input > stream to decompress successfully). So in such a case, Hadoop prepares only > one split per compressed file, where the lower split limit is at 0 while the > upper limit is the end of the file. The consequence of this decision is > that, one compress file goes to a single mapper. Although it circumvents the > limitation of codecs (as mentioned above) but reduces the parallelism > substantially, as it was possible otherwise in case of splitting. > BZip2 is a compression / De-Compression algorithm which does compression on > blocks of data and later these compressed blocks can be decompressed > independent of each other. This is indeed an opportunity that instead of one > BZip2 compressed file going to one mapper, we can process chunks of file in > parallel. The correctness criteria of such a processing is that for a bzip2 > compressed file, each compressed block should be processed by only one mapper > and ultimately all the blocks of the file should be processed. (By > processing we mean the actual utilization of that un-compressed data (coming > out of the codecs) in a mapper). > We are writing the code to implement this suggested functionality. Although > we have used bzip2 as an example, but we have tried to extend Hadoop's > compression interfaces so that any other codecs with the same capability as > that of bzip2, could easily use the splitting support. The details of these > changes will be posted when we submit the code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.