[ 
https://issues.apache.org/jira/browse/HADOOP-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209898#comment-13209898
 ] 

Tim Broberg commented on HADOOP-8003:
-------------------------------------

Ok, I'm ready to address this issue in earnest now.

I see three basic approaches here:

1 - Status Quo: For any splittable compression input stream, just extend 
CompressionInputStream and write any pieces you need from DecompressorStream, 
BlockDecompressorStream, or what have you from scratch. This isn't pretty, but 
it works with 1.0.0 and trunk. Anybody that wants to extend your new class is 
also out of luck.

2 - Compromise: Make SplitCompressionInputStream an interface. (As in my 
previous suggestion, but eliminate #3 and #5. I realize now you can just return 
an interface and treat this as a class.) Applications are unchanged, might have 
to tweak bzip a bit. Alternately, Tom's idea might work out better here, but I 
don't think I grokked it fully from the description.

3 - Ideal case: Dump the whole splittable codec structure and use the 
previously existing seekable interface of CompressionInputStream. In 
LineRecordReader (and TestCodec), try to seek (and/or skip?) to the offset you 
need which would be handled by splittable codecs. Non-splittable codecs 
continue to throw unsupported in which case LineRecordReader would revert to 
decoding sequentially as it does now. (I'm unclear on the state of skip in 
CompressionInputStream. Does InputStream.skip() just work?) This actually backs 
out two classes (three if you count HADOOP-7076), simplifying the interface, 
but would require modifications to LineRecordReader, TestCodec, lzop, and 
bzip2. This would make CompressionInputStreams conform to the general purpose 
Seekable interface, which would open up new usage possibilities, and seems much 
cleaner than the other options. For one thing, there's no messy business of 
asking for offsets start through end and getting something else entirely - you 
seek to start and read until you reach end and the underlying 
CompressionInputStream takes care of discarding the uninteresting bits.

In my own case, I need to be able to provide code to customers in a timely 
fashion as a plugin,  support versions back to 1.0.0, and incorporate into core 
when appropriate.

To meet these goals, status quo (#1) is looking pretty tolerable to me now. 
There are about 8 stubby methods from DecompressorStream I will have to 
duplicate, but if the community would prefer to pursue one of the tidier 
options, I'd be happy to contribute.

Comments?

Anybody feel like talking me out of being lazy?

    - Tim.

                
> Make SplitCompressionInputStream an interface instead of an abstract class
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-8003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8003
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.21.0, 0.22.0, 0.23.0, 1.0.0
>            Reporter: Tim Broberg
>
> To be splittable, a codec must extend SplittableCompressionCodec which has a 
> function returning a SplitCompressionInputStream.
> SplitCompressionInputStream is an abstract class which extends 
> CompressionInputStream, the lowest level compression stream class.
> So, no codec that wants to be splittable can reuse any code from 
> DecompressorStream or BlockDecompressorStream.
> You either have to duplicate that code, or not be splittable.
> SplitCompressionInputStream adds just a few very thin functions. Can we make 
> this an interface rather than an abstract class to allow splittable 
> decompression streams to extend DecompressorStream, BlockDecompressorStream, 
> or whatever else we should scheme up in the future?
> To my knowledge, this would impact only the BZip2 codec. None of the other 
> implement this form of splittability yet.
> LineRecordReader looks only at whether the codec is an instance of 
> SplittableCompressionCodec, and then calls the appropriate version of 
> createInputStream. This would not change, so the application code should not 
> have to change, just BZip and SplitCompressionInputStream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to