[
https://issues.apache.org/jira/browse/COMPRESS-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134945#comment-15134945
]
Dawid Weiss commented on COMPRESS-333:
--------------------------------------
I couldn't see much difference between the code you currently have and the one
from Hadoop (speed-wise) once I switched buffered input on -- this was the core
of the issue, not the decompression implementation itself. Perhaps (very
likely!) there are still nuances in the code to improve upon, but I won't have
the time to look into it (I'd rather look into implementing random-access to 7z
archive entries).
I'm glad I found out the buffered/non-buffered issue though as the previous
difference in performance was quite shocking. :)
> bz2 stream decompressor is 10x slower than it could be
> ------------------------------------------------------
>
> Key: COMPRESS-333
> URL: https://issues.apache.org/jira/browse/COMPRESS-333
> Project: Commons Compress
> Issue Type: Improvement
> Reporter: Dawid Weiss
>
> This is related to COMPRESS-291. In short: decompressing 7z archives was an
> order of magnitude slower in Java than with native tooling.
> My investigation showed that the problematic archive used bz2 streams inside.
> I then did a quick hack-experiment which took bz2 decompressor from the
> Apache Hadoop project (the Java version, not the native one) and replaced the
> default one used for bz2 stream decompression of the 7z archiver in commons.
> I then ran a quick benchmark on this file:
> {code}
> https://archive.org/download/stackexchange/english.stackexchange.com.7z
> {code}
> The decompression speeds are (SSD, the file was essentially fully cached in
> memory, so everything is CPU bound):
> {code}
> native {{7za}}: 13 seconds
> Commons (original): 222 seconds
> Commons (patched w/Hadoop bz2): 30 seconds
> Commons (patched w/BufferedInputStream): 28 seconds
> {code}
> Yes, it's still 3 times slower than native code, but it's no longer glacially
> slow...
> My patch is a quick and dirty proof of concept (not committable, see [1]),
> but it passes the tests. Some notes:
> - Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd
> have to be either patched in the code or (better) decorated at a level above
> the low-level decoder,
> - I only substituted the decompressor in 7z, but obviously this could benefit
> in other places (zip, etc.); essentially, I'd remove
> BZip2CompressorInputStream entirely.
> - while I toyed around with the above idea I noticed a really annoying thing
> -- all streams are required to extend {{CompressorInputStream}}, which only
> adds one method to count the number of consumed bytes. This complicates the
> code and makes plugging in other implementations of InputStreams more
> cumbersome. I could get rid of CompressorInputStream entirely with a few
> minor changes to the code, but obviously this would be backward incompatible
> (see [2]).
> References:
> [1] GitHub fork, {{bzip2}} branch:
> https://github.com/dweiss/commons-compress/tree/bzip2
> [2] Removal and cleanup of CompressorInputStream:
> https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)