[
https://issues.apache.org/jira/browse/HADOOP-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated HADOOP-14376:
--------------------------------
Summary: Memory leak when reading a compressed file using the native
library (was: Memory leak when reading a bzip2-compressed file using the
native library)
Thanks for the report, [~eliac]! This problem isn't specific to bzip2, as I
was able to reproduce the problem with both the gzip and zstandard codecs. I
updated the summary accordingly.
This looks like it may have been an accidental oversight when HADOOP-10591 was
added. Before that change the DecompressorStream close method was a superset
of what CompressionInputStream did.
It looks like LineRecordReader and some other users of codecs aren't
susceptible to this because they explicitly get the decompressor from the codec
pool, create the input stream, then explicitly return the decompressor to the
pool afterwards. I believe it's safe to try to return the same decompressor to
the pool multiple times, so we should be able to safely update the
DecompressorStream to call super.close() rather than in.close(). Also should
be straightforward to write a unit test, using
CodecPool.getLeasedDecompressorsCount to verify the codec is not being returned
to the pool before the change and is afterwards.
[~eliac] are you interested in taking a crack at the patch? If not then I
should be able to put up something later this week.
> Memory leak when reading a compressed file using the native library
> -------------------------------------------------------------------
>
> Key: HADOOP-14376
> URL: https://issues.apache.org/jira/browse/HADOOP-14376
> Project: Hadoop Common
> Issue Type: Bug
> Components: common, io
> Affects Versions: 2.7.0
> Reporter: Eli Acherkan
> Attachments: Bzip2MemoryTester.java, log4j.properties
>
>
> Opening and closing a large number of bzip2-compressed input streams causes
> the process to be killed on OutOfMemory when using the native bzip2 library.
> Our initial analysis suggests that this can be caused by
> {{DecompressorStream}} overriding the {{close()}} method, and therefore
> skipping the line from its parent:
> {{CodecPool.returnDecompressor(trackedDecompressor)}}. When the decompressor
> object is a {{Bzip2Decompressor}}, its native {{end()}} method is never
> called, and the allocated memory isn't freed.
> If this analysis is correct, the simplest way to fix this bug would be to
> replace {{in.close()}} with {{super.close()}} in {{DecompressorStream}}.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]