Re: [FileVault][discuss] performance improvement proposal

Thomas Mueller Thu, 09 Mar 2017 06:19:35 -0800

Hi,

Entries in a zip file are compressed individually, so the dictionary is not 
kept or reused. See also https://en.wikipedia.org/wiki/Zip_(file_format) 
"Because the files in a .ZIP archive are compressed individually…"

By the way, for Deflate, the sliding window size (dictionary size) is 32 KB. 
That's quite small compared to, for example, LZMA (7z files), where it can be 
up to 4 GB (I think 64 MB by default nowadays). So even reusing the dictionary 
wouldn't help all that much for large files.

> if you have a lot of small text files, interleaved with binaries, then the 
> text files are probably not compressed

I would assume it's the reverse: text files are compressed, even if the 
binaries can't be compressed.

Regards,
Thomas

From: Tobias Bocanegra <tri...@apache.org>
Reply-To: "dev@jackrabbit.apache.org" <dev@jackrabbit.apache.org>
Date: Thursday, 9 March 2017 at 14:49
To: "dev@jackrabbit.apache.org" <dev@jackrabbit.apache.org>
Subject: Re: [FileVault][discuss] performance improvement proposal

Hi,

one issue to remember is that you can only change the compression level per 
zip-entry. I didn't test too much, but from the javadoc is says:

public void setLevel(int level)
Sets the compression level for subsequent entries which are DEFLATED. The 
default setting is DEFAULT_COMPRESSION.

I'm not exactly sure if zip retains the dictionary if you switch compression 
levels, but I would assume not. i.e. if you have a lot of small text files, 
interleaved with binaries, then the text files are probably not compressed. 
which might not be a problem, though.

it would be interresting to see some tests that take a typical content asset 
content package, that has many text files (.content.xml) and few compressed 
binaries (jpegs).

- what is the size difference of the final binary with no compression at all?
- what is the size difference of the final binary with interleaved compression?
- what are the performance characteristics to unpack/pack the zips?

regards, toby

On Thu, Mar 9, 2017 at 8:10 PM, Thomas Mueller 
<muel...@adobe.com<mailto:muel...@adobe.com>> wrote:
Hi,

> I think your help is mandatory, given the level of voodoo in the five lines 
> you propose :-)

Sure, I can help.

> I did some preliminary tests with the "partial entropy" method … and it seems 
> the algorithm works but it does not get as fast as the content type detection 
> method.

Note you only need to test about 256 bytes, not the whole binary. Sure, the 
more the better.

> Maybe ultimately we could keep both heuristics.

I agree. But not to speed up things: to avoid false positives / negatives. 
Auto-detection is far from perfect.

> Start with the content type detection that would match against MIME types we 
> know for sure are compressed (expected to be a reasonably fixed and short 
> list of MIME types).

I would probably use the following logic:

* list of mime types that are compressed (text/plain and so on)
* list of mime types that should not be compressed (application/zip, 
application/java-archive, and so on)

For the remainder, and if you don't know the mime type, I would use 
auto-detection.

Regards,
Thomas

Re: [FileVault][discuss] performance improvement proposal

Reply via email to