[jira] [Commented] (COMPRESS-333) bz2 stream decompressor is 10x slower than it could be

Dawid Weiss (JIRA) Fri, 05 Feb 2016 12:44:57 -0800

    [ 
https://issues.apache.org/jira/browse/COMPRESS-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134945#comment-15134945
 ]


Dawid Weiss commented on COMPRESS-333:
--------------------------------------

I couldn't see much difference between the code you currently have and the one 
from Hadoop (speed-wise) once I switched buffered input on -- this was the core 
of the issue, not the decompression implementation itself. Perhaps (very 
likely!) there are still nuances in the code to improve upon, but I won't have 
the time to look into it (I'd rather look into implementing random-access to 7z 
archive entries).

I'm glad I found out the buffered/non-buffered issue though as the previous 
difference in performance was quite shocking. :)

> bz2 stream decompressor is 10x slower than it could be
> ------------------------------------------------------
>
>                 Key: COMPRESS-333
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-333
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>
> This is related to COMPRESS-291. In short: decompressing 7z archives was an 
> order of magnitude slower in Java than with native tooling.
> My investigation showed that the problematic archive used bz2 streams inside. 
> I then did a quick hack-experiment which took bz2 decompressor from the 
> Apache Hadoop project (the Java version, not the native one) and replaced the 
> default one used for bz2 stream decompression of the 7z archiver in commons.
> I then ran a quick benchmark on this file:
> {code}
> https://archive.org/download/stackexchange/english.stackexchange.com.7z
> {code}
> The decompression speeds are (SSD, the file was essentially fully cached in 
> memory, so everything is CPU bound):
> {code}
> native {{7za}}: 13 seconds
> Commons (original): 222 seconds
> Commons (patched w/Hadoop bz2): 30 seconds
> Commons (patched w/BufferedInputStream): 28 seconds
> {code}
> Yes, it's still 3 times slower than native code, but it's no longer glacially 
> slow... 
> My patch is a quick and dirty proof of concept (not committable, see [1]), 
> but it passes the tests. Some notes:
> - Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd 
> have to be either patched in the code or (better) decorated at a level above 
> the low-level decoder,
> - I only substituted the decompressor in 7z, but obviously this could benefit 
> in other places (zip, etc.); essentially, I'd remove 
> BZip2CompressorInputStream entirely.
> - while I toyed around with the above idea I noticed a really annoying thing 
> -- all streams are required to extend {{CompressorInputStream}}, which only 
> adds one method to count the number of consumed bytes. This complicates the 
> code and makes plugging in other implementations of InputStreams more 
> cumbersome. I could get rid of CompressorInputStream entirely with a few 
> minor changes to the code, but obviously this would be backward incompatible 
> (see [2]).
> References:
> [1] GitHub fork, {{bzip2}} branch: 
> https://github.com/dweiss/commons-compress/tree/bzip2
> [2] Removal and cleanup of CompressorInputStream: 
> https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (COMPRESS-333) bz2 stream decompressor is 10x slower than it could be

Reply via email to