[ 
https://issues.apache.org/jira/browse/COMPRESS-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113466#comment-17113466
 ] 

A Kelday edited comment on COMPRESS-514 at 5/21/20, 6:54 PM:
-------------------------------------------------------------

After digging in a bit more this takes me back to the same CRC problem as 
before, but with some new info after looking at the 7zip source.

It looks like 7zip does nearly the same as the current Commons Compress; read 
the whole header buffer into ram and CRC before parsing. The difference is 
that's an unsigned int, so maximum 4GiB (above that is unsupported). Indeed 
7zip uses over 5GiB ram simply to show the files list of this 1.2TB archive.

That leads to at least three options:
 # 7zip method: read all into ram (with multiple buffers up to 4G) for CRC and 
parse
 # Read the header twice if necessary: once streamed for CRC, the next using a 
small buffer to parse. If the header fits in our small buffer entirely no extra 
read is required.
 # Read/parse the header and compute CRC at the same time (bad because you 
don't find out the data is wrong until it's too late)

It would be great to have some opinion here, because this is more than I'd 
hoped it would require to fix. There's always the choice to just not support 
over 2G...


was (Author: akelday):
After digging in a bit more this takes me back to the same CRC problem as 
before, but with some new info after looking at the 7zip source.

It looks like 7zip does nearly the same as the current Commons Compress; read 
the whole header buffer into ram and CRC before parsing. The difference is 
that's an unsigned int, so maximum 4GiB (above that is unsupported). Indeed 
7zip uses over 5GiB ram simply to show the files list of this 1.2TB archive.

That leads to at least three options:
 # 7zip method: read all into ram (with multiple buffers up to 4G) for CRC and 
parse
 # Read the header twice if necessary: once streamed for CRC, the next using a 
small buffer to parse. If the header fits in our small buffer entirely no extra 
read is required.
 # Read the header and compute CRC at the same time (bad because you don't find 
out the data is wrong until it's too late)

It would be great to have some opinion here, because this is more than I'd 
hoped it would require to fix. There's always the choice to just not support 
over 2G...

> SevenZFile fails with encoded header over 2GiB
> ----------------------------------------------
>
>                 Key: COMPRESS-514
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-514
>             Project: Commons Compress
>          Issue Type: Bug
>          Components: Archivers
>    Affects Versions: 1.20
>            Reporter: A Kelday
>            Priority: Minor
>         Attachments: HeaderChannelBuffer.java
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When reading what some may call a large encrypted 7zip file (1.2TB with 22 
> million files), the read fails at the header stage with the trace below. Is 
> this within the spec? I've written some code to handle it, because I did 
> actually need to extract the file in java. If that's of any use I can provide 
> it (it's a naive wrapper that just pages in a buffer at a time).
>  
> {code:java}
> Exception in thread "main" java.io.IOException: Cannot handle 
> unpackSize2416988886
> at 
> org.apache.commons.compress.archivers.sevenz.SevenZFile.assertFitsIntoInt(SevenZFile.java:1523)
> at 
> org.apache.commons.compress.archivers.sevenz.SevenZFile.readEncodedHeader(SevenZFile.java:622)
> at 
> org.apache.commons.compress.archivers.sevenz.SevenZFile.initializeArchive(SevenZFile.java:532)
> at 
> org.apache.commons.compress.archivers.sevenz.SevenZFile.readHeaders(SevenZFile.java:468)
> at 
> org.apache.commons.compress.archivers.sevenz.SevenZFile.<init>(SevenZFile.java:337)
> at 
> org.apache.commons.compress.archivers.sevenz.SevenZFile.<init>(SevenZFile.java:129)
> at 
> org.apache.commons.compress.archivers.sevenz.SevenZFile.<init>(SevenZFile.java:116)
> {code}
> 7zip itself can also open it (and display/extract etc.), here are the stats:
>  
>  
> {code:java}
> Size: 2 489 903 580 875
> Packed Size: 1 349 110 308 832
> Folders: 40 005
> Files: 22 073 957
> CRC: E26F6A96
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to