[
https://issues.apache.org/jira/browse/COMPRESS-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322748#comment-17322748
]
Gaël Lalire commented on COMPRESS-574:
--------------------------------------
I will try be more concrete.
You have a ZIP Z containing 3 files (A, B, C), but you want to allow the user
to choose which ones he wants.
So if he request download?zip_name=myzip&file_names=A,C
then your server will download Z, uncompress only A and C, and create a new ZIP
containing only A and C.
It has a disk and proc usage cost to uncompress and compress again.
With my solution you don't store Z at all, instead you store A, B and C and
possibly A_DEFLATED, B_DEFLATED and C_DEFLATED (if it is worth).
So if a user request download?zip_name=myzip&file_names=A,C
then your server will stream the ZIP, so only A (or A_DEFLATED) and C (or
C_DEFLATED) will be fetch, not B content, and the content fetched does not need
any buffer or disk space as it is a stream.
The byte range part is if the user do a
curl -r 200-500 download?zip_name=myzip&file_names=A,C
If the range between 200-500 is 10 bytes of the local header of C and 290 first
bytes of C, then only C content will be fetch, A content is not needed.
In my case A,B,C are stored in Amazon S3 and I store file metadata (name,CRC32,
size, deflated size) in an Oracle DB.
Here I put a file_names filter but the filter can be more complex, you can have
user permission check, type filter ...
And the problems was too much I/O used for creating the filtered ZIP and
sometimes not enough data space if too many simultaneous user.
> Byte range support in archive creation
> --------------------------------------
>
> Key: COMPRESS-574
> URL: https://issues.apache.org/jira/browse/COMPRESS-574
> Project: Commons Compress
> Issue Type: Improvement
> Components: Archivers
> Reporter: Gaël Lalire
> Priority: Minor
> Attachments: DynamicZip.java, DynamicZipTest.java
>
>
> When you have a ZIP which contains _N_ components and you want to let the
> user choose which components it needs, you need to create _2^N - 1_ ZIP.
> So the idea is to store each component once (or twice if you want both
> deflated and stored version), and create the ZIP on the fly.
> For the moment you can stream with a ZipOutputStream but if you need an
> InputStream things get a lot harder. I guess programs are writing the ZIP to
> a file system and read from it after, so not really a streaming anymore.
> Also ZipOutputStream will never allow you to resume from a byte range, you
> need to generate all previous data.
> So I made a class to do that, I think such functionality has its place in
> commons compress.
> You can see my code attached and adapt it for better integration / other
> archive type support or simply to get inspired.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)