[ 
https://issues.apache.org/jira/browse/HADOOP-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210118#comment-13210118
 ] 

Tim Broberg commented on HADOOP-8003:
-------------------------------------

Actually, once I got away from the computer, I saw why seek() is the wrong tool 
for the job and option #3 doesn't work. A user of the stream expects seek() to 
address offsets in the *decompressed* byte stream whereas LineRecordReader 
wants to split the *compressed* stream at block boundaries.

So, it's status quo or switch to return an interface in the most compatible way 
we can come up with.

...or get really radical and split into fixed size decompressed chunks rather 
than compressed, but that messes up locality, and we don't want that.
                
> Make SplitCompressionInputStream an interface instead of an abstract class
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-8003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8003
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.21.0, 0.22.0, 0.23.0, 1.0.0
>            Reporter: Tim Broberg
>
> To be splittable, a codec must extend SplittableCompressionCodec which has a 
> function returning a SplitCompressionInputStream.
> SplitCompressionInputStream is an abstract class which extends 
> CompressionInputStream, the lowest level compression stream class.
> So, no codec that wants to be splittable can reuse any code from 
> DecompressorStream or BlockDecompressorStream.
> You either have to duplicate that code, or not be splittable.
> SplitCompressionInputStream adds just a few very thin functions. Can we make 
> this an interface rather than an abstract class to allow splittable 
> decompression streams to extend DecompressorStream, BlockDecompressorStream, 
> or whatever else we should scheme up in the future?
> To my knowledge, this would impact only the BZip2 codec. None of the other 
> implement this form of splittability yet.
> LineRecordReader looks only at whether the codec is an instance of 
> SplittableCompressionCodec, and then calls the appropriate version of 
> createInputStream. This would not change, so the application code should not 
> have to change, just BZip and SplitCompressionInputStream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to