[
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720436#action_12720436
]
Chris Douglas commented on HADOOP-4012:
---------------------------------------
* If an IOException is ignored:
{noformat}
+ if (!(in instanceof Seekable) || !(in instanceof PositionedReadable)) {
+ try {
+ this.maxAvailableData = in.available();
+ } catch (IOException e) {
+ }
+ }
{noformat}
Please include a comment explaining why it should be ignored. Won't this fail
in the first call to getPos() if it doesn't throw in the cstr?
* This change to TestCodec looks suspect:
{noformat}
- SequenceFile.Reader reader = new SequenceFile.Reader(fs, filePath, conf);
-
+ SequenceFile.Reader reader = null;
+ try{
+ reader = new SequenceFile.Reader(fs, filePath, conf);
+ }
+ catch(Exception exp){}
{noformat}
* Instead of the tenary operatior in FSInputChecker, {{Math.max(0L, count -
pos)}} is more readable.
* With changes such as the following, will the case resolved in HADOOP-3144
continue to work in {{LineRecordReader}}?
{noformat}
- int newSize = in.readLine(value, maxLineLength,
- Math.max((int)Math.min(Integer.MAX_VALUE,
end-pos),
- maxLineLength));
+ int newSize = lineReader.readLine(value, maxLineLength,
Integer.MAX_VALUE);
{noformat}
* Specifying the constant as {{1L}} should avoid this:
{noformat}
- return (bsBuffShadow >> (bsLiveShadow - n)) & ((1 << n) - 1);
+ final long one = 1;
+ return (bsBuffShadow >> (bsLiveShadow - n)) & ((one << n) - 1);
{noformat}
* The {{InputStreamCreationResultant}} struct seems unnecessary, but perhaps
I'm not following the logic there. Of the three auxiliary params returned,
{{end}} is unmodified from the actual and {{areLimitsChanged}} only avoids a
redundant assignment to {{start}}. Since {{CompressionInputStream}} now
implements {{Seekable}}, can this be replaced with {{start = in.getPos()}}?
* CBZip2InputStream::read() shouldn't allocate a new {{byte[]}} for every read,
but reuse an instance var. ({{0xFF}} also works, btw)
> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
> Key: HADOOP-4012
> URL: https://issues.apache.org/jira/browse/HADOOP-4012
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Affects Versions: 0.21.0
> Reporter: Abdul Qadeer
> Assignee: Abdul Qadeer
> Fix For: 0.21.0
>
> Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch,
> Hadoop-4012-version3.patch, Hadoop-4012-version4.patch,
> Hadoop-4012-version5.patch, Hadoop-4012-version6.patch,
> Hadoop-4012-version7.patch, Hadoop-4012-version8.patch,
> Hadoop-4012-version9.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split
> (mainly due to the limitation of many codecs that they need the whole input
> stream to decompress successfully). So in such a case, Hadoop prepares only
> one split per compressed file, where the lower split limit is at 0 while the
> upper limit is the end of the file. The consequence of this decision is
> that, one compress file goes to a single mapper. Although it circumvents the
> limitation of codecs (as mentioned above) but reduces the parallelism
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on
> blocks of data and later these compressed blocks can be decompressed
> independent of each other. This is indeed an opportunity that instead of one
> BZip2 compressed file going to one mapper, we can process chunks of file in
> parallel. The correctness criteria of such a processing is that for a bzip2
> compressed file, each compressed block should be processed by only one mapper
> and ultimately all the blocks of the file should be processed. (By
> processing we mean the actual utilization of that un-compressed data (coming
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality. Although
> we have used bzip2 as an example, but we have tried to extend Hadoop's
> compression interfaces so that any other codecs with the same capability as
> that of bzip2, could easily use the splitting support. The details of these
> changes will be posted when we submit the code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.