[jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Abdul Qadeer (JIRA) Wed, 02 Sep 2009 11:00:58 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750566#action_12750566
 ]


Abdul Qadeer commented on HADOOP-4012:
--------------------------------------

{quote}
{noformat}
+    if(in.getPos() <= start){
+      ((Seekable)seekableIn).seek(start);
+        in = this.createInputStream(seekableIn, readMode);
+        
+
+    }
{noformat}

This drops the stream it created above, which wrapped the stream passed in with 
a CBZip2InputStream and BufferedInputStream. It's not clear why the stream is 
being re-created, either... particularly since the start stored in the codec is 
left alone. What case is being handled here?
{quote}

The reason to re-create the stream for the case when in.getPos() <= start is to 
tackle the cases like the following:

Assume [BBBBBB] represents a BZip2 maker and d is a single compressed data 
element (this can happen
e.g. due to BZip2 concatenation)

There is some extra information at the start of stream i.e. BZ0h

^ indicates where currently the stream is:
{noformat}

[BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
_______________________________ ^

I go back 10 bytes in the stream before finding a marker.  The reason
is that the first 'maker' is 10 bytes long, all others are 6 bytes long.

So after going backwards the stream position is as follows:

[BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
__________________ ^

Now finding next marker might align us with the wrong marker as follows:

[BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
______________________ ^
{noformat}

So for such cases the code mentioned above works.  But you rightly mentioned 
that I should have done this.start = start at the end of above code as well.


{quote}
I tried a version of this using a supertype of CompressionInputStream instead 
of the semantics tried so far (voiding the synchronization discussion). It 
doesn't incorporate the other changes discussed.
{quote}

The new version looks fine to me.  Let me incorporate the other changes you 
mentioned in it and to put the new patch on the JIRA



> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
>                 Key: HADOOP-4012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4012
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.21.0
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>             Fix For: 0.21.0
>
>         Attachments: C4012-12.patch, Hadoop-4012-version1.patch, 
> Hadoop-4012-version10.patch, Hadoop-4012-version11.patch, 
> Hadoop-4012-version2.patch, Hadoop-4012-version3.patch, 
> Hadoop-4012-version4.patch, Hadoop-4012-version5.patch, 
> Hadoop-4012-version6.patch, Hadoop-4012-version7.patch, 
> Hadoop-4012-version8.patch, Hadoop-4012-version9.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split 
> (mainly due to the limitation of many codecs that they need the whole input 
> stream to decompress successfully).  So in such a case, Hadoop prepares only 
> one split per compressed file, where the lower split limit is at 0 while the 
> upper limit is the end of the file.  The consequence of this decision is 
> that, one compress file goes to a single mapper. Although it circumvents the 
> limitation of codecs (as mentioned above) but reduces the parallelism 
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on 
> blocks of data and later these compressed blocks can be decompressed 
> independent of each other.  This is indeed an opportunity that instead of one 
> BZip2 compressed file going to one mapper, we can process chunks of file in 
> parallel.  The correctness criteria of such a processing is that for a bzip2 
> compressed file, each compressed block should be processed by only one mapper 
> and ultimately all the blocks of the file should be processed.  (By 
> processing we mean the actual utilization of that un-compressed data (coming 
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality.  Although 
> we have used bzip2 as an example, but we have tried to extend Hadoop's 
> compression interfaces so that any other codecs with the same capability as 
> that of bzip2, could easily use the splitting support.  The details of these 
> changes will be posted when we submit the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Reply via email to