Dawid Weiss created COMPRESS-333:
------------------------------------

             Summary: bz2 stream decompressor is 10x slower than it could be
                 Key: COMPRESS-333
                 URL: https://issues.apache.org/jira/browse/COMPRESS-333
             Project: Commons Compress
          Issue Type: Improvement
            Reporter: Dawid Weiss


This is related to COMPRESS-291. In short: decompressing 7z archives was an 
order of magnitude slower in Java than with native tooling.

My investigation showed that the problematic archive used bz2 streams inside. I 
then did a quick hack-experiment which took bz2 decompressor from the Apache 
Hadoop project (the Java version, not the native one) and replaced the default 
one used for bz2 stream decompression of the 7z archiver in commons.

I then ran a quick benchmark on this file:
{code}
https://archive.org/download/stackexchange/english.stackexchange.com.7z
{code}

The decompression speeds are (SSD, the file was essentially fully cached in 
memory, so everything is CPU bound):
{code}
native {{7za}}: 13 seconds
Commons (original): 222 seconds
Commons (patched w/Hadoop bz2): 30 seconds
{code}

Yes, it's still 3 times slower than native code, but it's no longer glacially 
slow... 

My patch is a quick and dirty proof of concept (not committable, see [1]), but 
it passes the tests. Some notes:

- Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd have 
to be either patched in the code or (better) decorated at a level above the 
low-level decoder,
- I only substituted the decompressor in 7z, but obviously this could benefit 
in other places (zip, etc.); essentially, I'd remove BZip2CompressorInputStream 
entirely.
- while I toyed around with the above idea I noticed a really annoying thing -- 
all streams are required to extend {{CompressorInputStream}}, which only adds 
one method to count the number of consumed bytes. This complicates the code and 
makes plugging in other implementations of InputStreams more cumbersome. I 
could get rid of CompressorInputStream entirely with a few minor changes to the 
code, but obviously this would be backward incompatible (see [2]).

References:
[1] GitHub fork, {{bzip2}} branch: 
https://github.com/dweiss/commons-compress/tree/bzip2
[2] Removal and cleanup of CompressorInputStream: 
https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to