[jira] [Updated] (COMPRESS-438) ZipFile should create a buffered input stream for decoders inside getInputStream(ZipArchiveEntry)

Dawid Weiss (JIRA) Tue, 09 Jan 2018 13:25:30 -0800

     [ 
https://issues.apache.org/jira/browse/COMPRESS-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dawid Weiss updated COMPRESS-438:
---------------------------------
    Description: 
When decoders read from a raw underlying stream (such as a file channel), the 
performance can degrade an order of magnitude compared to the case when there's 
a simple buffer in between the physical data source and the codec. 

See COMPRESS-380 for an example of this.

The API of {{ZipFile}} is straightforward and tempting enough that blocks of 
code such as:
{code}
    try (ZipFile zfile = new ZipFile("/path/to/zip.zip")) { 
      Enumeration<ZipArchiveEntry> entries = zfile.getEntries();
      while (entries.hasMoreElements()) {
        ZipArchiveEntry e = entries.nextElement();

        try (InputStream is = zfile.getInputStream(e)) {
          // do something with is
        }
      }
    }
{code}

seem perfectly justified. The above code suffers from severe performance 
degradation compared to prebuffered input. Severe means *severe*. Here are some 
stats from running a snippet of code similar to the above to "just decompress" 
the same input (~80mb) compressed with different methods.

{code}
# Input compressed with 7za.
7za a -mm=deflate   archive-deflate.zip   input
7za a -mm=deflate64 archive-deflate64.zip input
7za a -mm=bzip2     archive-bzip2.zip     input
{code}

Current master branch:
{code}
# Direct BoundedInputStream
archive-deflate.zip 0.39 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 26.98 sec., 37,856,848 archived => 81,502,941 decompressed
archive-bzip2.zip 49.16 sec., 38,166,089 archived => 81,502,941 decompressed
{code}

And a simple patch wrapping BoundedInputStream with a BufferedInputStream 
(deflate64 and bzip2 only, deflate uses java's internal inflater and it 
prebuffers stuff internally).
{code}
# wrapped with BufferedInputStream, bufferSize = 512
archive-deflate.zip 0.38 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 0.95 sec., 37,856,848 archived => 81,504,783 decompressed
archive-bzip2.zip 3.00 sec., 38,166,089 archived => 81,502,941 decompressed

# wrapped with BufferedInputStream, bufferSize = 1024
archive-deflate.zip 0.41 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 0.97 sec., 37,856,848 archived => 81,504,783 decompressed
archive-bzip2.zip 2.95 sec., 38,166,089 archived => 81,502,941 decompressed

# wrapped with BufferedInputStream, bufferSize = 4096
archive-deflate.zip 0.42 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 0.89 sec., 37,856,848 archived => 81,504,783 decompressed
archive-bzip2.zip 2.97 sec., 38,166,089 archived => 81,502,941 decompressed
{code}

The difference should be evident, even with a tiny buffer of 512 bytes. To put 
this into perspective on a larger archive:

{code}
archive-deflate.zip 7.68 sec., 1,235,209,977 archived => 1,339,916,520 
decompressed
archive-deflate64.zip 784.87 sec., 1,233,470,780 archived => 1,339,916,520 
decompressed
{code}

deflate64 improves by ~4900%...

{code}
archive-deflate.zip 8.12 sec., 1,235,209,977 archived => 1,339,916,520 
decompressed
archive-deflate64.zip 16.24 sec., 1,233,470,780 archived => 1,339,981,595 
decompressed
{code}

I also see that {{ExplodingInputStream}} is already wrapping {{bis}} in a 
buffered input stream, so I don't see any reason why this shouldn't be done for 
other compressor streams. An even better patch (to me) would be to modify the 
constructors of {{Deflate64CompressorInputStream}} and 
{{BZip2CompressorInputStream}} and add a boolean parameter {{unbuffered}}: then 
people would know what they're doing when they pass some input stream and 
{{true}} to such a constructor. The default, single-argument constructor would 
simply delegate to {{constructor(inputStream, false)}} to ensure an input 
buffer in between the decoder and the raw stream.

The patch is trivial, so I don't attach it?

  was:
When decoders read from a raw underlying stream (such as a file channel), the 
performance can degrade an order of magnitude compared to the case when there's 
a simple buffer in between the physical data source and the codec. 

See COMPRESS-380 for an example of this.

The API of {{ZipFile}} is straightforward and tempting enough that blocks of 
code such as:
{code}
    try (ZipFile zfile = new ZipFile("/path/to/zip.zip")) { 
      Enumeration<ZipArchiveEntry> entries = zfile.getEntries();
      while (entries.hasMoreElements()) {
        ZipArchiveEntry e = entries.nextElement();

        try (InputStream is = zfile.getInputStream(e)) {
          // do something with is
        }
      }
    }
{code}

seem perfectly justified. The above code suffers from severe performance 
degradation compared to prebuffered input. Severe means *severe*. Here are some 
stats from running a snippet of code similar to the above to "just decompress" 
the same input (~80mb) compressed with different methods.

{code}
# Input compressed with 7za.
7za a -mm=deflate   archive-deflate.zip   input
7za a -mm=deflate64 archive-deflate64.zip input
7za a -mm=bzip2     archive-bzip2.zip     input
{code}

Current master branch:
{code}
# Direct BoundedInputStream
archive-deflate.zip 0.39 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 26.98 sec., 37,856,848 archived => 81,502,941 decompressed
archive-bzip2.zip 49.16 sec., 38,166,089 archived => 81,502,941 decompressed
{code}

And a simple patch wrapping BoundedInputStream with a BufferedInputStream 
(deflate64 and bzip2 only, deflate uses java's internal inflater and it 
prebuffers stuff internally).
{code}
# wrapped with BufferedInputStream, bufferSize = 512
archive-deflate.zip 0.38 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 0.95 sec., 37,856,848 archived => 81,504,783 decompressed
archive-bzip2.zip 3.00 sec., 38,166,089 archived => 81,502,941 decompressed

# wrapped with BufferedInputStream, bufferSize = 1024
archive-deflate.zip 0.41 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 0.97 sec., 37,856,848 archived => 81,504,783 decompressed
archive-bzip2.zip 2.95 sec., 38,166,089 archived => 81,502,941 decompressed

# wrapped with BufferedInputStream, bufferSize = 4096
archive-deflate.zip 0.42 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 0.89 sec., 37,856,848 archived => 81,504,783 decompressed
archive-bzip2.zip 2.97 sec., 38,166,089 archived => 81,502,941 decompressed
{code}

The difference should be evident, even with a tiny buffer of 512 bytes. To put 
this into perspective on a larger archive:

{code}
archive-deflate.zip 7.68 sec., 1,235,209,977 archived => 1,339,916,520 
decompressed
archive-deflate64.zip 784.87 sec., 1,233,470,780 archived => 1,339,916,520 
decompressed
{code}

deflate64 improves by ~4900%...

{code}
archive-deflate.zip 8.12 sec., 1,235,209,977 archived => 1,339,916,520 
decompressed
archive-deflate64.zip 16.24 sec., 1,233,470,780 archived => 1,339,981,595 
decompressed
{code}

I also see that {{ExplodingInputStream}} is already wrapping {{bis}} in a 
buffered input stream, so I don't see any reason why this shouldn't be done for 
other compressor streams. An even better patch (to me) would be to modify the 
constructors of {{Deflate64CompressorInputStream}} and 
{{BZip2CompressorInputStream}} and add a boolean parameter {{unbuffered}}: then 
people would know what they're doing when they pass some input stream and 
{{true}} to such a constructor. The default, single-argument constructor would 
simply delegate to {{constructor(inputStream, true)}} to ensure an input buffer 
in between the decoder and the raw stream.

The patch is trivial, so I don't attach it?


> ZipFile should create a buffered input stream for decoders inside 
> getInputStream(ZipArchiveEntry)
> -------------------------------------------------------------------------------------------------
>
>                 Key: COMPRESS-438
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-438
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: Check.java
>
>
> When decoders read from a raw underlying stream (such as a file channel), the 
> performance can degrade an order of magnitude compared to the case when 
> there's a simple buffer in between the physical data source and the codec. 
> See COMPRESS-380 for an example of this.
> The API of {{ZipFile}} is straightforward and tempting enough that blocks of 
> code such as:
> {code}
>     try (ZipFile zfile = new ZipFile("/path/to/zip.zip")) { 
>       Enumeration<ZipArchiveEntry> entries = zfile.getEntries();
>       while (entries.hasMoreElements()) {
>         ZipArchiveEntry e = entries.nextElement();
>         try (InputStream is = zfile.getInputStream(e)) {
>           // do something with is
>         }
>       }
>     }
> {code}
> seem perfectly justified. The above code suffers from severe performance 
> degradation compared to prebuffered input. Severe means *severe*. Here are 
> some stats from running a snippet of code similar to the above to "just 
> decompress" the same input (~80mb) compressed with different methods.
> {code}
> # Input compressed with 7za.
> 7za a -mm=deflate   archive-deflate.zip   input
> 7za a -mm=deflate64 archive-deflate64.zip input
> 7za a -mm=bzip2     archive-bzip2.zip     input
> {code}
> Current master branch:
> {code}
> # Direct BoundedInputStream
> archive-deflate.zip 0.39 sec., 37,893,785 archived => 81,502,941 decompressed
> archive-deflate64.zip 26.98 sec., 37,856,848 archived => 81,502,941 
> decompressed
> archive-bzip2.zip 49.16 sec., 38,166,089 archived => 81,502,941 decompressed
> {code}
> And a simple patch wrapping BoundedInputStream with a BufferedInputStream 
> (deflate64 and bzip2 only, deflate uses java's internal inflater and it 
> prebuffers stuff internally).
> {code}
> # wrapped with BufferedInputStream, bufferSize = 512
> archive-deflate.zip 0.38 sec., 37,893,785 archived => 81,502,941 decompressed
> archive-deflate64.zip 0.95 sec., 37,856,848 archived => 81,504,783 
> decompressed
> archive-bzip2.zip 3.00 sec., 38,166,089 archived => 81,502,941 decompressed
> # wrapped with BufferedInputStream, bufferSize = 1024
> archive-deflate.zip 0.41 sec., 37,893,785 archived => 81,502,941 decompressed
> archive-deflate64.zip 0.97 sec., 37,856,848 archived => 81,504,783 
> decompressed
> archive-bzip2.zip 2.95 sec., 38,166,089 archived => 81,502,941 decompressed
> # wrapped with BufferedInputStream, bufferSize = 4096
> archive-deflate.zip 0.42 sec., 37,893,785 archived => 81,502,941 decompressed
> archive-deflate64.zip 0.89 sec., 37,856,848 archived => 81,504,783 
> decompressed
> archive-bzip2.zip 2.97 sec., 38,166,089 archived => 81,502,941 decompressed
> {code}
> The difference should be evident, even with a tiny buffer of 512 bytes. To 
> put this into perspective on a larger archive:
> {code}
> archive-deflate.zip 7.68 sec., 1,235,209,977 archived => 1,339,916,520 
> decompressed
> archive-deflate64.zip 784.87 sec., 1,233,470,780 archived => 1,339,916,520 
> decompressed
> {code}
> deflate64 improves by ~4900%...
> {code}
> archive-deflate.zip 8.12 sec., 1,235,209,977 archived => 1,339,916,520 
> decompressed
> archive-deflate64.zip 16.24 sec., 1,233,470,780 archived => 1,339,981,595 
> decompressed
> {code}
> I also see that {{ExplodingInputStream}} is already wrapping {{bis}} in a 
> buffered input stream, so I don't see any reason why this shouldn't be done 
> for other compressor streams. An even better patch (to me) would be to modify 
> the constructors of {{Deflate64CompressorInputStream}} and 
> {{BZip2CompressorInputStream}} and add a boolean parameter {{unbuffered}}: 
> then people would know what they're doing when they pass some input stream 
> and {{true}} to such a constructor. The default, single-argument constructor 
> would simply delegate to {{constructor(inputStream, false)}} to ensure an 
> input buffer in between the decoder and the raw stream.
> The patch is trivial, so I don't attach it?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (COMPRESS-438) ZipFile should create a buffered input stream for decoders inside getInputStream(ZipArchiveEntry)

Reply via email to