[
https://issues.apache.org/jira/browse/HADOOP-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998011#comment-13998011
]
Colin Patrick McCabe commented on HADOOP-10591:
-----------------------------------------------
Hmm. The JIRA talks about "direct buffers allocated by compression codecs like
Gzip (which allocates 2 direct buffers per instance)."
I assume this is a reference to {{ZlibDecompressor#compressedDirectBuf}} and
{{ZlibDecompressor#uncompressedDirectBuf}}. Those are buffers inside
{{Decompressor}} objects, not buffers inside {{Codec}} objects.
*However*... {{CodecPool}} has a cache for {{Compressor}} and {{Decompressor}}
objects, but it seems to be optional, not mandatory. For example, this code in
{{SequenceFile}} is careful to use the {{Decompressor}} cache:
{code}
keyLenDecompressor = CodecPool.getDecompressor(codec);
keyLenInFilter = codec.createInputStream(keyLenBuffer,
keyLenDecompressor);
{code}
On the other hand, there are also one-argument versions of the
{{createInputStream}} functions that always create a new {{Decompressor}} (and
similar one-argument versions for {{createOutputStream}}).
What's the right resolution here? Is it just to mark the one-argument versions
as deprecated and audit HDFS and Hadoop client programs to remove usages? That
certainly seems like a good idea, if we want to cache these {{ByteBuffers}}
without adding more caching mechanisms.
> Compression codecs must used pooled direct buffers or deallocate direct
> buffers when stream is closed
> -----------------------------------------------------------------------------------------------------
>
> Key: HADOOP-10591
> URL: https://issues.apache.org/jira/browse/HADOOP-10591
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Hari Shreedharan
> Assignee: Colin Patrick McCabe
>
> Currently direct buffers allocated by compression codecs like Gzip (which
> allocates 2 direct buffers per instance) are not deallocated when the stream
> is closed. Eventually for long running processes which create a huge number
> of files, these direct buffers are left hanging till a full gc, which may or
> may not happen in a reasonable amount of time - especially if the process
> does not use a whole lot of heap.
> Either these buffers should be pooled or they should be deallocated when the
> stream is closed.
--
This message was sent by Atlassian JIRA
(v6.2#6252)