[jira] [Work logged] (COMPRESS-623) make ZipFile's getRawInputStream usable when local headers are not read

ASF GitHub Bot (Jira) Sat, 13 Aug 2022 00:22:04 -0700


     [ 
https://issues.apache.org/jira/browse/COMPRESS-623?focusedWorklogId=800412&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-800412
 ]


ASF GitHub Bot logged work on COMPRESS-623:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Aug/22 07:21
            Start Date: 13/Aug/22 07:21
    Worklog Time Spent: 10m 
      Work Description: dweiss commented on code in PR #306:
URL: https://github.com/apache/commons-compress/pull/306#discussion_r944327197


##########
src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java:
##########
@@ -638,13 +646,11 @@ public InputStream getInputStream(final ZipArchiveEntry 
ze)
         }
         // cast validity is checked just above
         ZipUtil.checkRequestedFeatures(ze);
-        final long start = getDataOffset(ze);
 
         // doesn't get closed if the method is not supported - which
         // should never happen because of the checkRequestedFeatures
         // call above
-        final InputStream is =
-            new BufferedInputStream(createBoundedInputStream(start, 
ze.getCompressedSize())); //NOSONAR
+        final InputStream is = new BufferedInputStream(getRawInputStream(ze)); 
//NOSONAR
         switch (ZipMethod.getMethodByCode(ze.getMethod())) {

Review Comment:
   This replaces duplicate code by just requesting a raw compressed stream, 
which should be always available.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 800412)
    Time Spent: 1h 20m  (was: 1h 10m)

> make ZipFile's getRawInputStream usable when local headers are not read
> -----------------------------------------------------------------------
>
>                 Key: COMPRESS-623
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-623
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Minor
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I have a somewhat odd use case with gigabytes of ZIP files, each with 
> thousands of documents (on comparatively slow, network drives). We need to 
> restructure these ZIPs without the need to recompress files.
> The above turns out to work almost perfectly with raw-data copying ZipFile 
> offers but empirical tests showed a major slowdown in the initial opening of 
> zip files, linked to multiple reads/seeks for local file headers. If an 
> option is passed to ignore those headers, raw streams are inaccessible.
> I've taken a look at the code and the code in getRawInputStream could 
> basically do the same thing that getInputStream does - lazily load the 
> missing offset via getDataOffset(ZipEntry). In fact, getInputStream could 
> just call getRawInputStream directly, which avoids some code duplication. 
> I see speedups for opening and copying random raw streams in the order of 
> 3-4x and all the current tests pass. I filed a PR at github - happy to 
> discuss it there.
> [https://github.com/apache/commons-compress/pull/306]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (COMPRESS-623) make ZipFile's getRawInputStream usable when local headers are not read

Reply via email to