[ 
https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401710#comment-13401710
 ] 

Owen O'Malley commented on HADOOP-8148:
---------------------------------------

Sorry for coming into this late.

I've been working with the compression codecs recently and I have several 
related observations:
1. No one seems to use the compressors/decompressors directly. They always use 
the streams.
2. The current interface is difficult to implement efficiently. To avoid 
copies, I  always end up implementing the streams directly rather than use a 
compressor.
3. As with most of this kind of code, the pure java version of them is much 
less hassle and more performant than a jni version.
4. There aren't that many users out there, but the users include all of the 
important file formats (SequenceFile, TFile, HFile, and RCFile) and the 
MapReduce framework. (That isn't to say that we can delete the old interfaces, 
but they aren't user facing to the same level as FileSystem, Mapper, and 
Reducer.)

My inclination is that extending Compressor/Decompressor is a mistake. On the 
other hand, making a sub-class of Codec seems like a good idea so that we can 
make Codecs that implement both the new and old interfaces.

Thoughts?
                
> Zero-copy ByteBuffer-based compressor / decompressor API
> --------------------------------------------------------
>
>                 Key: HADOOP-8148
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8148
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io, performance
>            Reporter: Tim Broberg
>            Assignee: Tim Broberg
>         Attachments: hadoop8148.patch
>
>
> Per Todd Lipcon's comment in HDFS-2834, "
>   Whenever a native decompression codec is being used, ... we generally have 
> the following copies:
>   1) Socket -> DirectByteBuffer (in SocketChannel implementation)
>   2) DirectByteBuffer -> byte[] (in SocketInputStream)
>   3) byte[] -> Native buffer (set up for decompression)
>   4*) decompression to a different native buffer (not really a copy - 
> decompression necessarily rewrites)
>   5) native buffer -> byte[]
>   with the proposed improvement we can hopefully eliminate #2,#3 for all 
> applications, and #2,#3,and #5 for libhdfs.
> "
> The interfaces in the attached patch attempt to address:
>  A - Compression and decompression based on ByteBuffers (HDFS-2834)
>  B - Zero-copy compression and decompression (HDFS-3051)
>  C - Provide the caller a way to know how the max space required to hold 
> compressed output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to