[jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Chris Douglas (JIRA) Tue, 16 Jun 2009 17:24:32 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720436#action_12720436
 ]


Chris Douglas commented on HADOOP-4012:
---------------------------------------

* If an IOException is ignored:
{noformat}
+    if (!(in instanceof Seekable) || !(in instanceof PositionedReadable)) {
+      try {
+        this.maxAvailableData = in.available();
+      } catch (IOException e) {
+      }
+    }
{noformat}
Please include a comment explaining why it should be ignored. Won't this fail 
in the first call to getPos() if it doesn't throw in the cstr?
* This change to TestCodec looks suspect:
{noformat}
-    SequenceFile.Reader reader = new SequenceFile.Reader(fs, filePath, conf);
-    
+    SequenceFile.Reader reader = null;
+    try{
+    reader = new SequenceFile.Reader(fs, filePath, conf);
+    }
+    catch(Exception exp){}
{noformat}
* Instead of the tenary operatior in FSInputChecker, {{Math.max(0L, count - 
pos)}} is more readable.
* With changes such as the following, will the case resolved in HADOOP-3144 
continue to work in {{LineRecordReader}}?
{noformat}
-      int newSize = in.readLine(value, maxLineLength,
-                                Math.max((int)Math.min(Integer.MAX_VALUE, 
end-pos),
-                                         maxLineLength));
+      int newSize = lineReader.readLine(value, maxLineLength, 
Integer.MAX_VALUE);
{noformat}
* Specifying the constant as {{1L}} should avoid this:
{noformat}
-    return (bsBuffShadow >> (bsLiveShadow - n)) & ((1 << n) - 1);
+    final long one = 1;
+    return (bsBuffShadow >> (bsLiveShadow - n)) & ((one << n) - 1);
{noformat}
* The {{InputStreamCreationResultant}} struct seems unnecessary, but perhaps 
I'm not following the logic there. Of the three auxiliary params returned, 
{{end}} is unmodified from the actual and {{areLimitsChanged}} only avoids a 
redundant assignment to {{start}}. Since {{CompressionInputStream}} now 
implements {{Seekable}}, can this be replaced with {{start = in.getPos()}}?
* CBZip2InputStream::read() shouldn't allocate a new {{byte[]}} for every read, 
but reuse an instance var. ({{0xFF}} also works, btw)

> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
>                 Key: HADOOP-4012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4012
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.21.0
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>             Fix For: 0.21.0
>
>         Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch, 
> Hadoop-4012-version3.patch, Hadoop-4012-version4.patch, 
> Hadoop-4012-version5.patch, Hadoop-4012-version6.patch, 
> Hadoop-4012-version7.patch, Hadoop-4012-version8.patch, 
> Hadoop-4012-version9.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split 
> (mainly due to the limitation of many codecs that they need the whole input 
> stream to decompress successfully).  So in such a case, Hadoop prepares only 
> one split per compressed file, where the lower split limit is at 0 while the 
> upper limit is the end of the file.  The consequence of this decision is 
> that, one compress file goes to a single mapper. Although it circumvents the 
> limitation of codecs (as mentioned above) but reduces the parallelism 
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on 
> blocks of data and later these compressed blocks can be decompressed 
> independent of each other.  This is indeed an opportunity that instead of one 
> BZip2 compressed file going to one mapper, we can process chunks of file in 
> parallel.  The correctness criteria of such a processing is that for a bzip2 
> compressed file, each compressed block should be processed by only one mapper 
> and ultimately all the blocks of the file should be processed.  (By 
> processing we mean the actual utilization of that un-compressed data (coming 
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality.  Although 
> we have used bzip2 as an example, but we have tried to extend Hadoop's 
> compression interfaces so that any other codecs with the same capability as 
> that of bzip2, could easily use the splitting support.  The details of these 
> changes will be posted when we submit the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Reply via email to