[ https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720436#action_12720436 ]
Chris Douglas commented on HADOOP-4012: --------------------------------------- * If an IOException is ignored: {noformat} + if (!(in instanceof Seekable) || !(in instanceof PositionedReadable)) { + try { + this.maxAvailableData = in.available(); + } catch (IOException e) { + } + } {noformat} Please include a comment explaining why it should be ignored. Won't this fail in the first call to getPos() if it doesn't throw in the cstr? * This change to TestCodec looks suspect: {noformat} - SequenceFile.Reader reader = new SequenceFile.Reader(fs, filePath, conf); - + SequenceFile.Reader reader = null; + try{ + reader = new SequenceFile.Reader(fs, filePath, conf); + } + catch(Exception exp){} {noformat} * Instead of the tenary operatior in FSInputChecker, {{Math.max(0L, count - pos)}} is more readable. * With changes such as the following, will the case resolved in HADOOP-3144 continue to work in {{LineRecordReader}}? {noformat} - int newSize = in.readLine(value, maxLineLength, - Math.max((int)Math.min(Integer.MAX_VALUE, end-pos), - maxLineLength)); + int newSize = lineReader.readLine(value, maxLineLength, Integer.MAX_VALUE); {noformat} * Specifying the constant as {{1L}} should avoid this: {noformat} - return (bsBuffShadow >> (bsLiveShadow - n)) & ((1 << n) - 1); + final long one = 1; + return (bsBuffShadow >> (bsLiveShadow - n)) & ((one << n) - 1); {noformat} * The {{InputStreamCreationResultant}} struct seems unnecessary, but perhaps I'm not following the logic there. Of the three auxiliary params returned, {{end}} is unmodified from the actual and {{areLimitsChanged}} only avoids a redundant assignment to {{start}}. Since {{CompressionInputStream}} now implements {{Seekable}}, can this be replaced with {{start = in.getPos()}}? * CBZip2InputStream::read() shouldn't allocate a new {{byte[]}} for every read, but reuse an instance var. ({{0xFF}} also works, btw) > Providing splitting support for bzip2 compressed files > ------------------------------------------------------ > > Key: HADOOP-4012 > URL: https://issues.apache.org/jira/browse/HADOOP-4012 > Project: Hadoop Core > Issue Type: New Feature > Components: io > Affects Versions: 0.21.0 > Reporter: Abdul Qadeer > Assignee: Abdul Qadeer > Fix For: 0.21.0 > > Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch, > Hadoop-4012-version3.patch, Hadoop-4012-version4.patch, > Hadoop-4012-version5.patch, Hadoop-4012-version6.patch, > Hadoop-4012-version7.patch, Hadoop-4012-version8.patch, > Hadoop-4012-version9.patch > > > Hadoop assumes that if the input data is compressed, it can not be split > (mainly due to the limitation of many codecs that they need the whole input > stream to decompress successfully). So in such a case, Hadoop prepares only > one split per compressed file, where the lower split limit is at 0 while the > upper limit is the end of the file. The consequence of this decision is > that, one compress file goes to a single mapper. Although it circumvents the > limitation of codecs (as mentioned above) but reduces the parallelism > substantially, as it was possible otherwise in case of splitting. > BZip2 is a compression / De-Compression algorithm which does compression on > blocks of data and later these compressed blocks can be decompressed > independent of each other. This is indeed an opportunity that instead of one > BZip2 compressed file going to one mapper, we can process chunks of file in > parallel. The correctness criteria of such a processing is that for a bzip2 > compressed file, each compressed block should be processed by only one mapper > and ultimately all the blocks of the file should be processed. (By > processing we mean the actual utilization of that un-compressed data (coming > out of the codecs) in a mapper). > We are writing the code to implement this suggested functionality. Although > we have used bzip2 as an example, but we have tried to extend Hadoop's > compression interfaces so that any other codecs with the same capability as > that of bzip2, could easily use the splitting support. The details of these > changes will be posted when we submit the code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.