[ 
https://issues.apache.org/jira/browse/COMPRESS-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated COMPRESS-623:
---------------------------------
    Summary: make ZipFile's getRawInputStream usable when local headers are not 
read  (was: Speed up ZipFile's raw stream copying by not seeking for local 
headers until they're needed)

> make ZipFile's getRawInputStream usable when local headers are not read
> -----------------------------------------------------------------------
>
>                 Key: COMPRESS-623
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-623
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Minor
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> I have a somewhat odd use case with gigabytes of ZIP files, each with 
> thousands of documents (on comparatively slow, network drives). We need to 
> restructure these ZIPs without the need to recompress files.
> The above turns out to work almost perfectly with raw-data copying ZipFile 
> offers but empirical tests showed a major slowdown in the initial opening of 
> zip files, linked to multiple reads/seeks for local file headers. If an 
> option is passed to ignore those headers, raw streams are inaccessible.
> I've taken a look at the code and the code in getRawInputStream could 
> basically do the same thing that getInputStream does - lazily load the 
> missing offset via getDataOffset(ZipEntry). In fact, getInputStream could 
> just call getRawInputStream directly, which avoids some code duplication. 
> I also restructured the code that maintains LinkedList for duplicated names a 
> bit. I think this bit could be removed altogether and replaced with a 
> pointer-chain list directly on Entry class instances (the map would then 
> point at the first entry; duplicated entries are an oddity and infrequent in 
> the wild).
> I see speedups for opening and copying random raw streams in the order of 
> 3-4x and all the current tests pass. I filed a PR at github - happy to 
> discuss it there.
> https://github.com/apache/commons-compress/pull/306



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to