[jira] [Commented] (COMPRESS-333) bz2 stream decompressor is 10x slower than it could be

ASF GitHub Bot (JIRA) Wed, 03 Feb 2016 05:45:12 -0800

    [ 
https://issues.apache.org/jira/browse/COMPRESS-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130402#comment-15130402
 ]


ASF GitHub Bot commented on COMPRESS-333:
-----------------------------------------

GitHub user dweiss opened a pull request:

    https://github.com/apache/commons-compress/pull/7

    COMPRESS-333: adds buffering on top of RandomAccessFile.

    Speeds up 7Z handling an order of magnitude as a result.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dweiss/commons-compress COMPRESS-333

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/commons-compress/pull/7.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7
    
----
commit 369c3165501681ec823cd84957a07b8825321862
Author: Dawid Weiss <[email protected]>
Date:   2016-02-03T13:43:29Z

    COMPRESS-333: adds buffering on top of RandomAccessFile. Speeds up 7z 
handling an order of magnitude as a result.

----


> bz2 stream decompressor is 10x slower than it could be
> ------------------------------------------------------
>
>                 Key: COMPRESS-333
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-333
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>
> This is related to COMPRESS-291. In short: decompressing 7z archives was an 
> order of magnitude slower in Java than with native tooling.
> My investigation showed that the problematic archive used bz2 streams inside. 
> I then did a quick hack-experiment which took bz2 decompressor from the 
> Apache Hadoop project (the Java version, not the native one) and replaced the 
> default one used for bz2 stream decompression of the 7z archiver in commons.
> I then ran a quick benchmark on this file:
> {code}
> https://archive.org/download/stackexchange/english.stackexchange.com.7z
> {code}
> The decompression speeds are (SSD, the file was essentially fully cached in 
> memory, so everything is CPU bound):
> {code}
> native {{7za}}: 13 seconds
> Commons (original): 222 seconds
> Commons (patched w/Hadoop bz2): 30 seconds
> Commons (patched w/BufferedInputStream): 28 seconds
> {code}
> Yes, it's still 3 times slower than native code, but it's no longer glacially 
> slow... 
> My patch is a quick and dirty proof of concept (not committable, see [1]), 
> but it passes the tests. Some notes:
> - Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd 
> have to be either patched in the code or (better) decorated at a level above 
> the low-level decoder,
> - I only substituted the decompressor in 7z, but obviously this could benefit 
> in other places (zip, etc.); essentially, I'd remove 
> BZip2CompressorInputStream entirely.
> - while I toyed around with the above idea I noticed a really annoying thing 
> -- all streams are required to extend {{CompressorInputStream}}, which only 
> adds one method to count the number of consumed bytes. This complicates the 
> code and makes plugging in other implementations of InputStreams more 
> cumbersome. I could get rid of CompressorInputStream entirely with a few 
> minor changes to the code, but obviously this would be backward incompatible 
> (see [2]).
> References:
> [1] GitHub fork, {{bzip2}} branch: 
> https://github.com/dweiss/commons-compress/tree/bzip2
> [2] Removal and cleanup of CompressorInputStream: 
> https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (COMPRESS-333) bz2 stream decompressor is 10x slower than it could be

Reply via email to