[ https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723016#action_12723016 ]
Abdul Qadeer commented on HADOOP-4012: -------------------------------------- {quote} The InputStreamCreationResultant struct seems unnecessary, but perhaps I'm not following the logic there. Of the three auxiliary params returned, end is unmodified from the actual and areLimitsChanged only avoids a redundant assignment to start. Since CompressionInputStream now implements Seekable, can this be replaced with start = in.getPos()? {quote} {color:green} (1) I can not use start = getPos() in the LineRecordReader's constructor. The reason is that, for the BZip2 compressed data getPos() does not return the actual stream value. It is manipulated such a way in BZip2Codecs that, this hack works with LineRecordReader's start / end / pos stuff. On the other hand I need to do stuff (e.g. throwing away a line if it is not the first split) which required accurate value of start. So two pieces of information from the method createInputStream(...) were required: a) the new value of start b) the input stream The parameter "end" was also passed just in case some other future codec had the flexibility to change start or end or both. So I made that InputStreamCreationResultant because I wanted to return more than one things. To avoid making a new class, I can use some array which has inputstream on its 0th location and then the parameter start. But that will not be type safe. Another option might be to add another method in the SplitEnabledCompressionCodec to ask for the changed start value. So I will need to call createInputStream(...) and then call the new method e.g. getStart(). But the semantics of such a call were not neat and clear. Do we have any other option to avoid this "InputStreamCreationResultant"? (2) I have fixed the rest of the comments. Once this point is clear, I will prepare the final patch. {color} > Providing splitting support for bzip2 compressed files > ------------------------------------------------------ > > Key: HADOOP-4012 > URL: https://issues.apache.org/jira/browse/HADOOP-4012 > Project: Hadoop Common > Issue Type: New Feature > Components: io > Affects Versions: 0.21.0 > Reporter: Abdul Qadeer > Assignee: Abdul Qadeer > Fix For: 0.21.0 > > Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch, > Hadoop-4012-version3.patch, Hadoop-4012-version4.patch, > Hadoop-4012-version5.patch, Hadoop-4012-version6.patch, > Hadoop-4012-version7.patch, Hadoop-4012-version8.patch, > Hadoop-4012-version9.patch > > > Hadoop assumes that if the input data is compressed, it can not be split > (mainly due to the limitation of many codecs that they need the whole input > stream to decompress successfully). So in such a case, Hadoop prepares only > one split per compressed file, where the lower split limit is at 0 while the > upper limit is the end of the file. The consequence of this decision is > that, one compress file goes to a single mapper. Although it circumvents the > limitation of codecs (as mentioned above) but reduces the parallelism > substantially, as it was possible otherwise in case of splitting. > BZip2 is a compression / De-Compression algorithm which does compression on > blocks of data and later these compressed blocks can be decompressed > independent of each other. This is indeed an opportunity that instead of one > BZip2 compressed file going to one mapper, we can process chunks of file in > parallel. The correctness criteria of such a processing is that for a bzip2 > compressed file, each compressed block should be processed by only one mapper > and ultimately all the blocks of the file should be processed. (By > processing we mean the actual utilization of that un-compressed data (coming > out of the codecs) in a mapper). > We are writing the code to implement this suggested functionality. Although > we have used bzip2 as an example, but we have tried to extend Hadoop's > compression interfaces so that any other codecs with the same capability as > that of bzip2, could easily use the splitting support. The details of these > changes will be posted when we submit the code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.