[jira] [Work logged] (COMPRESS-623) make ZipFile's getRawInputStream usable when local headers are not read

ASF GitHub Bot (Jira) Sat, 13 Aug 2022 00:16:08 -0700


     [ 
https://issues.apache.org/jira/browse/COMPRESS-623?focusedWorklogId=800411&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-800411
 ]


ASF GitHub Bot logged work on COMPRESS-623:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Aug/22 07:15
            Start Date: 13/Aug/22 07:15
    Worklog Time Spent: 10m 
      Work Description: dweiss commented on PR #306:
URL: https://github.com/apache/commons-compress/pull/306#issuecomment-1213893701

   I was a bit lazy and wanted to piggyback too many changes, sorry about this. 
I've removed linked list removal for now and just left the changes related to 
getRawInputStream - hope it'll make them clearer to read. I've ran ```mvn test 
-Prun-zipit``` and everything passed. I'm not sure why getRawInputStream wasn't 
used inside getInputStream itself but if you take a look at the patch, this 
seems like a better decision to me.
   
   The only vital non-backward compatible change is the additional throws 
IOException added to getRawInputStream. I think this is more transparent in the 
end than hiding the exception under an unchecked exception (or swallowing it). 
People who use the API will most likely not even notice the change because they 
catch/handle IOException with other methods of ZipFile.




Issue Time Tracking
-------------------

    Worklog Id:     (was: 800411)
    Time Spent: 1h 10m  (was: 1h)

> make ZipFile's getRawInputStream usable when local headers are not read
> -----------------------------------------------------------------------
>
>                 Key: COMPRESS-623
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-623
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Minor
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I have a somewhat odd use case with gigabytes of ZIP files, each with 
> thousands of documents (on comparatively slow, network drives). We need to 
> restructure these ZIPs without the need to recompress files.
> The above turns out to work almost perfectly with raw-data copying ZipFile 
> offers but empirical tests showed a major slowdown in the initial opening of 
> zip files, linked to multiple reads/seeks for local file headers. If an 
> option is passed to ignore those headers, raw streams are inaccessible.
> I've taken a look at the code and the code in getRawInputStream could 
> basically do the same thing that getInputStream does - lazily load the 
> missing offset via getDataOffset(ZipEntry). In fact, getInputStream could 
> just call getRawInputStream directly, which avoids some code duplication. 
> I also restructured the code that maintains LinkedList for duplicated names a 
> bit. I think this bit could be removed altogether and replaced with a 
> pointer-chain list directly on Entry class instances (the map would then 
> point at the first entry; duplicated entries are an oddity and infrequent in 
> the wild).
> I see speedups for opening and copying random raw streams in the order of 
> 3-4x and all the current tests pass. I filed a PR at github - happy to 
> discuss it there.
> https://github.com/apache/commons-compress/pull/306



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (COMPRESS-623) make ZipFile's getRawInputStream usable when local headers are not read

Reply via email to