[ 
https://issues.apache.org/jira/browse/HADOOP-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810522#comment-13810522
 ] 

Gopal V commented on HADOOP-10047:
----------------------------------

I figured out why the patch isn't applying - branch-2 is missing a patch from 
trunk. I had assumed they were not meant to diverge, but the snappy native code 
has been patched only in trunk.

HADOOP-8151 (fixing exceptions) isn't merged to branch-2, but is checked into 
trunk. I think that needs a merge-from-trunk.

And to answer your second question, you can indeed pass 2 different src buffers 
for stream compression algorithms and it will work for zlib/gzip. But it does 
not work that way for SNAPPY, for instance - so I did not want to add that to 
the base interface contract.

But the real reason the src buffer is supplied to the decompress has to do with 
the GC and that's why the departure from the regular decompress API is made.

The Mapped direct buffers in HDFS are not subject to GC pressure & will not be 
collected till the regular heap overflows. Memory mapping a few hundred 10Mb 
blocks means just a few kb in heap space usage - but YARN does kill tasks which 
overflow vmem checks. So to ensure the tasks don't get killed by the vmem 
checks in YARN, good code has to end up calling  
FSDataInputStream::releaseBuffer(ByteBuffer buffer) once the buffer has been 
consumed. This means not leaving dangling references within the compressor - 
the old API can  rely on the GC collection, the new API can't because it 
doesn't trigger the GC.

So the API avoids creating a new reference to that buffer, so that we don't end 
up accidentally unmapping addresses that are still referred to. This is a 
slight burden on the user of the API, but makes them aware of this possibility 
while writing code and not just when debugging.

And as for using different src buffers, it would've been nice if it worked on 
all algorithms because mapped direct buffers cannot be consolidated. So if the 
compressed stream goes over the block boundaries, it makes sense to call 
decompress more times with new src buffers - 

> Add a directbuffer Decompressor API to hadoop
> ---------------------------------------------
>
>                 Key: HADOOP-10047
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10047
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 2.3.0
>            Reporter: Gopal V
>            Assignee: Gopal V
>              Labels: compression
>             Fix For: 2.3.0
>
>         Attachments: DirectCompressor.html, DirectDecompressor.html, 
> HADOOP-10047-WIP.patch, HADOOP-10047-final.patch, 
> HADOOP-10047-redo-WIP.patch, HADOOP-10047-with-tests.patch
>
>
> With the Zero-Copy reads in HDFS (HDFS-5260), it becomes important to perform 
> all I/O operations without copying data into byte[] buffers or other buffers 
> which wrap over them.
> This is a proposal for adding a DirectDecompressor interface to the 
> io.compress, to indicate codecs which want to surface the direct buffer layer 
> upwards.
> The implementation should work with direct heap/mmap buffers and cannot 
> assume .array() availability.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to