[jira] Updated: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Chris Douglas (JIRA) Sun, 30 Aug 2009 14:49:59 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Douglas updated HADOOP-4012:
----------------------------------

    Status: Open  (was: Patch Available)

This is what we were looking for. Thank you.

* {{CBZip2InputStream::skipToNextMarker}} should add IOException (for bsR) to 
its throws list, not Exception, and throw IllegalArgumentException if 
markerBitLength is larger than 63. This permits the removal of the empty catch 
clauses in calls to this method. The javadoc should include a description of 
this behavior, and note that marker is the EOB delimiter (w/ {...@param}} 
javadoc directives). The javadoc may be self-evident, but it's a public API.
* {{updateProcessedByteCount}} and {{updateReportedByteCount}} are public 
methods, but they seem very, very specialized. The comment explains that the 
client may manipulate the compressed stream, but any client sophisticated 
enough to do that is likely bound to this class. Would making these methods 
protected make sense?
* This method:
{noformat}
+  private static void reportCRCError() throws IOException {
+    throw new IOException("crc error");
+  }
{noformat}
Seems unnecessary. If the intent is to add a hook, then the method should be 
protected and non-static. Otherwise, there's no reason not to simply throw the 
IOE at its callers.
* The API issue with InputStreamCreationResultant is clearer to me, now (I 
hadn't seen why a reader might need a modified start). Other than synchronizing 
on the codec before creating new streams (to avoid the race condition), I don't 
see a better way to do this without pushing other API changes. Unless someone 
has a better idea, I think documenting this requirement on 
{{SplittableCompressionCodec}} is sufficient for now (and making these methods 
synchronized in {{BZip2Codec}}).

> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
>                 Key: HADOOP-4012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4012
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.21.0
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>             Fix For: 0.21.0
>
>         Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version10.patch, 
> Hadoop-4012-version11.patch, Hadoop-4012-version2.patch, 
> Hadoop-4012-version3.patch, Hadoop-4012-version4.patch, 
> Hadoop-4012-version5.patch, Hadoop-4012-version6.patch, 
> Hadoop-4012-version7.patch, Hadoop-4012-version8.patch, 
> Hadoop-4012-version9.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split 
> (mainly due to the limitation of many codecs that they need the whole input 
> stream to decompress successfully).  So in such a case, Hadoop prepares only 
> one split per compressed file, where the lower split limit is at 0 while the 
> upper limit is the end of the file.  The consequence of this decision is 
> that, one compress file goes to a single mapper. Although it circumvents the 
> limitation of codecs (as mentioned above) but reduces the parallelism 
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on 
> blocks of data and later these compressed blocks can be decompressed 
> independent of each other.  This is indeed an opportunity that instead of one 
> BZip2 compressed file going to one mapper, we can process chunks of file in 
> parallel.  The correctness criteria of such a processing is that for a bzip2 
> compressed file, each compressed block should be processed by only one mapper 
> and ultimately all the blocks of the file should be processed.  (By 
> processing we mean the actual utilization of that un-compressed data (coming 
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality.  Although 
> we have used bzip2 as an example, but we have tried to extend Hadoop's 
> compression interfaces so that any other codecs with the same capability as 
> that of bzip2, could easily use the splitting support.  The details of these 
> changes will be posted when we submit the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Reply via email to