[jira] Updated: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Chris Douglas (JIRA) Wed, 03 Dec 2008 20:24:07 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Douglas updated HADOOP-4012:
----------------------------------

    Fix Version/s:     (was: 0.20.0)
           Status: Open  (was: Patch Available)

{quote}
The following change was done in this new patch. Before this change, getPos()
was returning values one less than what it should be. Similarly available() 
method
was returning -1 because the value of count becomes -1 at the end of the chunk. 
{quote}
Should this change be part of a separate issue, then? I'm not sure what you 
mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new 
features if the fix is unaffected by the feature.

* This modifies TestMultipleCacheFiles to append a newline at the end of the 
file. Why is this necessary? Is this the same problem as HADOOP-4182?
* Pushing the READ_MODE abstraction (and the new createInputStream) into the 
CompressionCodec interface, particularly when only bzip supports it, is 
inappropriate. If it's applicable to codecs other than bzip, it should be a 
separate interface (extending CompressionCodec?). This would also let 
instanceof replace canDecompressSplitInput and move seekBackwards to the new 
interface. Can you describe what it means for a codec to implement this 
superset of functions?
* This patch incorporates HADOOP-4010:
{noformat}
-    while (pos < end) {
+    // We always read one extra line, which lies outside the upper
+    // split limit i.e. (end - 1)
+    pos = this.getPos();
+    
+    while (pos <= end) {
{noformat}
{noformat}
+    // If this is not the first split, we always throw away first record
+    // because we always (except the last split) read one extra line in
+    // next() method.
{noformat}
Shouldn't this remain with the original JIRA? Are the issues raised there 
addressed in this patch?
* Does this add the Seekable interface to CompressionInputStream only to 
support getPos() for LineRecordReader?

This affects too many core components to make the feature freeze for 0.20 (Fri).

> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
>                 Key: HADOOP-4012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4012
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>         Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version2.patch, 
> Hadoop-4012-version3.patch, Hadoop-4012-version4.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split 
> (mainly due to the limitation of many codecs that they need the whole input 
> stream to decompress successfully).  So in such a case, Hadoop prepares only 
> one split per compressed file, where the lower split limit is at 0 while the 
> upper limit is the end of the file.  The consequence of this decision is 
> that, one compress file goes to a single mapper. Although it circumvents the 
> limitation of codecs (as mentioned above) but reduces the parallelism 
> substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on 
> blocks of data and later these compressed blocks can be decompressed 
> independent of each other.  This is indeed an opportunity that instead of one 
> BZip2 compressed file going to one mapper, we can process chunks of file in 
> parallel.  The correctness criteria of such a processing is that for a bzip2 
> compressed file, each compressed block should be processed by only one mapper 
> and ultimately all the blocks of the file should be processed.  (By 
> processing we mean the actual utilization of that un-compressed data (coming 
> out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality.  Although 
> we have used bzip2 as an example, but we have tried to extend Hadoop's 
> compression interfaces so that any other codecs with the same capability as 
> that of bzip2, could easily use the splitting support.  The details of these 
> changes will be posted when we submit the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4012) Providing splitting support for bzip2 compressed files

Reply via email to