[ 
https://issues.apache.org/jira/browse/HADOOP-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated HADOOP-14376:
--------------------------------
    Summary: Memory leak when reading a compressed file using the native 
library  (was: Memory leak when reading a bzip2-compressed file using the 
native library)

Thanks for the report, [~eliac]!  This problem isn't specific to bzip2, as I 
was able to reproduce the problem with both the gzip and zstandard codecs.  I 
updated the summary accordingly.

This looks like it may have been an accidental oversight when HADOOP-10591 was 
added.  Before that change the DecompressorStream close method was a superset 
of what CompressionInputStream did.

It looks like LineRecordReader and some other users of codecs aren't 
susceptible to this because they explicitly get the decompressor from the codec 
pool, create the input stream, then explicitly return the decompressor to the 
pool afterwards.  I believe it's safe to try to return the same decompressor to 
the pool multiple times, so we should be able to safely update the 
DecompressorStream to call super.close() rather than in.close().  Also should 
be straightforward to write a unit test, using 
CodecPool.getLeasedDecompressorsCount to verify the codec is not being returned 
to the pool before the change and is afterwards.

[~eliac] are you interested in taking a crack at the patch?  If not then I 
should be able to put up something later this week.

> Memory leak when reading a compressed file using the native library
> -------------------------------------------------------------------
>
>                 Key: HADOOP-14376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14376
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common, io
>    Affects Versions: 2.7.0
>            Reporter: Eli Acherkan
>         Attachments: Bzip2MemoryTester.java, log4j.properties
>
>
> Opening and closing a large number of bzip2-compressed input streams causes 
> the process to be killed on OutOfMemory when using the native bzip2 library.
> Our initial analysis suggests that this can be caused by 
> {{DecompressorStream}} overriding the {{close()}} method, and therefore 
> skipping the line from its parent: 
> {{CodecPool.returnDecompressor(trackedDecompressor)}}. When the decompressor 
> object is a {{Bzip2Decompressor}}, its native {{end()}} method is never 
> called, and the allocated memory isn't freed.
> If this analysis is correct, the simplest way to fix this bug would be to 
> replace {{in.close()}} with {{super.close()}} in {{DecompressorStream}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to