[
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729967#action_12729967
]
Abdul Qadeer commented on HADOOP-4012:
--------------------------------------
I totally agree with Owen that LineRecordReader (LRR) is a heavily used code in
Hadoop and changes to such a code should be very carefully done. But that
doesn't mean that, LRR code is closed for any improvements and feature
enhancements. This LRR deals with Text data and Hadoop uses it as a default
because probably Text is the most frequently used kind on input. I think same
is true for BZip2 compressed files. If not all, most of them are Text
compressed data. This patch provides the following feature to the end user.
User puts his BZip2 compressed text files in an input directory and submits the
Hadoop job. Soon he gets the result in the output directory. That is it!
He hadn't had to write or mention
any input format
any record reader
And all this happens using full cluster CPU power (due to BZip2 splitting).
Also our algorithm doesn't demand any specific kind of splits from Hadoop. It
works with whatever splits provided to it.
.............................................................................
Now comes the correctness of LRR code.
testFormat() test case in org.apache.hadoop.mapred.TestTextInputFormat is a
very stringent test case to ensure the correctness of LRR. And since this test
case is frequently run whenever someone submits and tests a patch, so I dont
think that any correctness problem in LRR can escape.
Now comes the things like HADOOP-3114. That was my mistake as mentioned by
Chris also in his comments. I have corrected that and will upload the new patch
soon after running the tests locally. I think these are the places where the
vision and knowledge of Hadoop commmiters comes handy. If you see that other
than HADOOP-3114, I have missed something else, please tell me that and I will
fix it. Additionally Hadoop has a very active user community and any latent bug
in the code (especially the one used heavily) can not hide for long. Also the
serious businesses using Hadoop usually use only the stable version (e.g I
think Yahoo yet uses 0.18? while 0.20 is out).
So I see a safe transition to the changed LRR code.
........................................................
Now comes the point of making a new BZip2 specific input format / record reader.
If we make a separate reader for BZip2 e.g BZip2LineRecordReader, in that class
most of the code will be the replica of LRR. And doing that way, GZip support
should also come out of LRR and there will be a GZipLineRecordReader, again
mostly the code a replica of LRR. Whenever there comes a new codec in Hadoop we
will make a new reader whose 99% code is the replica of LRR. So in my view it
makes more sense to handle Line related text (be it plain or compressed) in LRR.
When we were adding splitting support for BZip2, we felt that there might be
situations when codecs want to change/manipulate the split start or end. So we
added support for that. Similarly instead LRR counting read bytes itself, it
now asks the stream about the position. This feature makes it possible for a
codec to manipulate indirectly that how long a reader should keep on reading.
We think these features can help for other block based codecs to do splitting
easily without further changes in LRR.
..............................................................
So in summary this patch might need relatively stringent reviewing by the
committers due to its heavy usage but it does add useful functionality in a
seamless way for the end user.
> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
> Key: HADOOP-4012
> URL: https://issues.apache.org/jira/browse/HADOOP-4012
> Project: Hadoop Common
> Issue Type: New Feature
> Components: io
> Affects Versions: 0.21.0
> Reporter: Abdul Qadeer
> Assignee: Abdul Qadeer
> Fix For: 0.21.0
>
> Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch,
> Hadoop-4012-version3.patch, Hadoop-4012-version4.patch,
> Hadoop-4012-version5.patch, Hadoop-4012-version6.patch,
> Hadoop-4012-version7.patch, Hadoop-4012-version8.patch,
> Hadoop-4012-version9.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split
> (mainly due to the limitation of many codecs that they need the whole input
> stream to decompress successfully). So in such a case, Hadoop prepares only
> one split per compressed file, where the lower split limit is at 0 while the
> upper limit is the end of the file. The consequence of this decision is
> that, one compress file goes to a single mapper. Although it circumvents the
> limitation of codecs (as mentioned above) but reduces the parallelism
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on
> blocks of data and later these compressed blocks can be decompressed
> independent of each other. This is indeed an opportunity that instead of one
> BZip2 compressed file going to one mapper, we can process chunks of file in
> parallel. The correctness criteria of such a processing is that for a bzip2
> compressed file, each compressed block should be processed by only one mapper
> and ultimately all the blocks of the file should be processed. (By
> processing we mean the actual utilization of that un-compressed data (coming
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality. Although
> we have used bzip2 as an example, but we have tried to extend Hadoop's
> compression interfaces so that any other codecs with the same capability as
> that of bzip2, could easily use the splitting support. The details of these
> changes will be posted when we submit the code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.