[
https://issues.apache.org/jira/browse/HADOOP-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eli Acherkan updated HADOOP-14376:
----------------------------------
Attachment: HADOOP-14376.002.patch
Thanks [~jlowe]. Those are excellent points, I completely agree that the patch
introduced subtle differences if some of the streams throw exceptions upon
close(). My previous reasoning was that in this case something's probably gone
horribly and irrevocably wrong anyway. But following your comments, I prepared
a more defensive patch, in which even if some of the close() or finish()
methods throw exceptions we still try to close/recover what we can. The price
of this is assuming that it's okay to call the close() method of a stream
multiple times.
*BZip2CompressionOutputStream*:
bq. Other maintainers will come along and see that BZip2CompressionOutputStream
never calls close() on one of its private member streams which is usually a
bug. Even if the close() ends up being redundant today that doesn't mean it
always will. The root cause for this JIRA is a great example.
In patch 002 I put back the BZip2CompressionOutputStream.close() method with a
call to super.close() and some explanatory documentation. It still seems to me
that calling super.close() should be sufficient, let me try to explain why.
bq. I'm not seeing how the superclass's finish() method ever ends up closing
the out stream. I see it write some final bytes and sets it to null, which in
turn prevents the close() method from trying to call out.close(), so I'm
wondering how the output stream normally gets closed.
My understanding is that the output stream _does_ get closed, thanks to the
out.close() call in CompressionOutputStream.close(). The {{out}} data member of
CBZip2OutputStream is indeed nullified, but the {{out}} data member of
CompressionOutputStream should still reference the actual stream object.
My reasoning was: the only difference between
BZip2CompressionOutputStream.finish() and close() is that
BZip2CompressionOutputStream.finish() calls output.finish(), whereas
BZip2CompressionOutputStream.close() calls output.flush() and output.close().
Changing BZip2CompressionOutputStream.close() to super.close() will mean that
we invoke finish() only instead of flush() and close(). Looking at
CBZip2OutputStream (which can be the only class of the {{output}} data member
in the current implementation), it seems to me that it's okay to invoke
finish() instead of flush() + close(), because the only difference between them
is calling out.flush() + out.close(). As I said above, out.close() will be
called anyway by CompressionOutputStream.close(), and I'm assuming that any
reasonable stream calls flush() internally on close().
*BZip2CompressionInputStream*:
In BZip2CompressionInputStream, patch 002 puts the call to super.close() in a
finally block. This preserves the previous logic (set needsReset to true only
if input.close() didn't throw) while ensuring that super.close() will
unconditionally close the {{in}} stream and return the trackedDecompressor to
the pool.
*CompressorStream/CompressionInput/OutputStream*:
bq. This is subtly different that the previous code because finish() can throw.
In the old code, finish() could throw and out.close() would still be called but
now we'll skip calling out.close() but still set closed=true so we can't retry
the close. (...) Similarly the CompressionInputStream/CompressionOutputStream
code won't return the codec to the pool if finish() throws or the underlying
stream's close() throws.
In patch 002 I wrapped each of CompressionInput/OutputStream.close()'s internal
steps in try/finally. (For CompressionOutputStream.close() this leaves the
corner case of both finish() _and_ out.close() throwing an exception each, but
I think it's reasonable that only one of them will be propagated since it's a
doomed stream anyway.) This brings the behavior of the patched
CompressorStream.close() to what it was before my changes: if finish() throws,
out.close() is still called.
> Memory leak when reading a compressed file using the native library
> -------------------------------------------------------------------
>
> Key: HADOOP-14376
> URL: https://issues.apache.org/jira/browse/HADOOP-14376
> Project: Hadoop Common
> Issue Type: Bug
> Components: common, io
> Affects Versions: 2.7.0
> Reporter: Eli Acherkan
> Assignee: Eli Acherkan
> Attachments: Bzip2MemoryTester.java, HADOOP-14376.001.patch,
> HADOOP-14376.002.patch, log4j.properties
>
>
> Opening and closing a large number of bzip2-compressed input streams causes
> the process to be killed on OutOfMemory when using the native bzip2 library.
> Our initial analysis suggests that this can be caused by
> {{DecompressorStream}} overriding the {{close()}} method, and therefore
> skipping the line from its parent:
> {{CodecPool.returnDecompressor(trackedDecompressor)}}. When the decompressor
> object is a {{Bzip2Decompressor}}, its native {{end()}} method is never
> called, and the allocated memory isn't freed.
> If this analysis is correct, the simplest way to fix this bug would be to
> replace {{in.close()}} with {{super.close()}} in {{DecompressorStream}}.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]