[ 
https://issues.apache.org/jira/browse/HADOOP-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998011#comment-13998011
 ] 

Colin Patrick McCabe commented on HADOOP-10591:
-----------------------------------------------

Hmm.  The JIRA talks about "direct buffers allocated by compression codecs like 
Gzip (which allocates 2 direct buffers per instance)."
I assume this is a reference to {{ZlibDecompressor#compressedDirectBuf}} and 
{{ZlibDecompressor#uncompressedDirectBuf}}.  Those are buffers inside 
{{Decompressor}} objects, not buffers inside {{Codec}} objects.

*However*... {{CodecPool}} has a cache for {{Compressor}} and {{Decompressor}} 
objects, but it seems to be optional, not mandatory.  For example, this code in 
{{SequenceFile}} is careful to use the {{Decompressor}} cache:
{code} 
         keyLenDecompressor = CodecPool.getDecompressor(codec);
          keyLenInFilter = codec.createInputStream(keyLenBuffer, 
                                                   keyLenDecompressor);
 {code}

On the other hand, there are also one-argument versions of the 
{{createInputStream}} functions that always create a new {{Decompressor}} (and 
similar one-argument versions for {{createOutputStream}}).

What's the right resolution here?  Is it just to mark the one-argument versions 
as deprecated and audit HDFS and Hadoop client programs to remove usages?  That 
certainly seems like a good idea, if we want to cache these {{ByteBuffers}} 
without adding more caching mechanisms.

> Compression codecs must used pooled direct buffers or deallocate direct 
> buffers when stream is closed
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-10591
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10591
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Colin Patrick McCabe
>
> Currently direct buffers allocated by compression codecs like Gzip (which 
> allocates 2 direct buffers per instance) are not deallocated when the stream 
> is closed. Eventually for long running processes which create a huge number 
> of files, these direct buffers are left hanging till a full gc, which may or 
> may not happen in a reasonable amount of time - especially if the process 
> does not use a whole lot of heap.
> Either these buffers should be pooled or they should be deallocated when the 
> stream is closed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to