[
https://issues.apache.org/jira/browse/COMPRESS-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130402#comment-15130402
]
ASF GitHub Bot commented on COMPRESS-333:
-----------------------------------------
GitHub user dweiss opened a pull request:
https://github.com/apache/commons-compress/pull/7
COMPRESS-333: adds buffering on top of RandomAccessFile.
Speeds up 7Z handling an order of magnitude as a result.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dweiss/commons-compress COMPRESS-333
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/commons-compress/pull/7.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7
----
commit 369c3165501681ec823cd84957a07b8825321862
Author: Dawid Weiss <[email protected]>
Date: 2016-02-03T13:43:29Z
COMPRESS-333: adds buffering on top of RandomAccessFile. Speeds up 7z
handling an order of magnitude as a result.
----
> bz2 stream decompressor is 10x slower than it could be
> ------------------------------------------------------
>
> Key: COMPRESS-333
> URL: https://issues.apache.org/jira/browse/COMPRESS-333
> Project: Commons Compress
> Issue Type: Improvement
> Reporter: Dawid Weiss
>
> This is related to COMPRESS-291. In short: decompressing 7z archives was an
> order of magnitude slower in Java than with native tooling.
> My investigation showed that the problematic archive used bz2 streams inside.
> I then did a quick hack-experiment which took bz2 decompressor from the
> Apache Hadoop project (the Java version, not the native one) and replaced the
> default one used for bz2 stream decompression of the 7z archiver in commons.
> I then ran a quick benchmark on this file:
> {code}
> https://archive.org/download/stackexchange/english.stackexchange.com.7z
> {code}
> The decompression speeds are (SSD, the file was essentially fully cached in
> memory, so everything is CPU bound):
> {code}
> native {{7za}}: 13 seconds
> Commons (original): 222 seconds
> Commons (patched w/Hadoop bz2): 30 seconds
> Commons (patched w/BufferedInputStream): 28 seconds
> {code}
> Yes, it's still 3 times slower than native code, but it's no longer glacially
> slow...
> My patch is a quick and dirty proof of concept (not committable, see [1]),
> but it passes the tests. Some notes:
> - Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd
> have to be either patched in the code or (better) decorated at a level above
> the low-level decoder,
> - I only substituted the decompressor in 7z, but obviously this could benefit
> in other places (zip, etc.); essentially, I'd remove
> BZip2CompressorInputStream entirely.
> - while I toyed around with the above idea I noticed a really annoying thing
> -- all streams are required to extend {{CompressorInputStream}}, which only
> adds one method to count the number of consumed bytes. This complicates the
> code and makes plugging in other implementations of InputStreams more
> cumbersome. I could get rid of CompressorInputStream entirely with a few
> minor changes to the code, but obviously this would be backward incompatible
> (see [2]).
> References:
> [1] GitHub fork, {{bzip2}} branch:
> https://github.com/dweiss/commons-compress/tree/bzip2
> [2] Removal and cleanup of CompressorInputStream:
> https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)