[
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662406#action_12662406
]
Abdul Qadeer commented on HADOOP-4012:
--------------------------------------
Chris Douglas:
I understand that any change to e.g. LineRecordReader (LRR) should be carried
out with care because it the most frequently used code. It might not have been
impossible to implement bzip2 codecs with the older implementation of LRR but
with performance penalty for bzip2 (because it then had to decompress one more
block per mapper) and the code in the LRR constructor and the next() method
would have been a mess.
HADOOP-4010 was separately created before finishing bzip2 work but
unfortunately that couldn't get in. Owen O'Malley mentioned some concerns ,
to which I provided some logical hand waiving. I guess Owen has been busy or
is not convinced with those arguments but he didn't said anything. I plan to
ask him again that should I provide some kind of test cases just to make sure
that his concerns are addressed.
So going systematically I first plan to get 4010 approved and then move on by
doing recommended changes. I have successfully used the bzip2 codecs for a
binary compressed bzip2 files. So I hope by going step by step, I will be able
to get this work approved.
> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
> Key: HADOOP-4012
> URL: https://issues.apache.org/jira/browse/HADOOP-4012
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Reporter: Abdul Qadeer
> Assignee: Abdul Qadeer
> Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch,
> Hadoop-4012-version3.patch, Hadoop-4012-version4.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split
> (mainly due to the limitation of many codecs that they need the whole input
> stream to decompress successfully). So in such a case, Hadoop prepares only
> one split per compressed file, where the lower split limit is at 0 while the
> upper limit is the end of the file. The consequence of this decision is
> that, one compress file goes to a single mapper. Although it circumvents the
> limitation of codecs (as mentioned above) but reduces the parallelism
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on
> blocks of data and later these compressed blocks can be decompressed
> independent of each other. This is indeed an opportunity that instead of one
> BZip2 compressed file going to one mapper, we can process chunks of file in
> parallel. The correctness criteria of such a processing is that for a bzip2
> compressed file, each compressed block should be processed by only one mapper
> and ultimately all the blocks of the file should be processed. (By
> processing we mean the actual utilization of that un-compressed data (coming
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality. Although
> we have used bzip2 as an example, but we have tried to extend Hadoop's
> compression interfaces so that any other codecs with the same capability as
> that of bzip2, could easily use the splitting support. The details of these
> changes will be posted when we submit the code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.