DO NOT REPLY [Bug 31930] - Zip & Unzip tasks major slowdown

bugzilla 8 Nov 2004 14:29:32 -0000

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=31930>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://issues.apache.org/bugzilla/show_bug.cgi?id=31930

Zip & Unzip tasks major slowdown





------- Additional Comments From [EMAIL PROTECTED]  2004-11-08 14:29 -------
We are using the zip part (only the expanding/reading part - so I can only
speak for that one), too. And also noticed how slow it is.

You might also want to change your test script that it displays 2 times:
- The time used to open create the Zipfile instance
- The time used to iterate over the files and directories and expand them.
If you do that, you will see that ant's zipfile implementation needs a long 
time to open a zip. Why will become clear soon.

I am currently checking and changing the sourcecode (unfortunately for ant this
is going to be 1.4 java) but I can line out some of the flaws here.

- Zipfiles use intel byte order, also known as little Endian.
  The sources do the conversion correctly, but speak of "Big Endian" all
  the time. This is a cosmetic bug, but wrong doc leads to wrong derivated work.

- ZipShort/ZipLong classes should have static helper methods to get primitive
  values. Instantiating an object just for the sake of getValue() doesn't help
  performance.

- When a zip is opened, the central directory is read in entry for entry and 
  each entry is parsed. This should be ok, it might be beneficial to read the
  whole directory into memory, but not really that much thanks to modern
  filesystems and caching (depends on directory size).
  The only point skipped are the extra information. I have no idea why.
  Wouldn't it be better to parse the extra data, or keep the raw data for
  later on-demand parsing? (No real flaws here ;)

- After that, local header information for each entry is gathered.
  This is the starting offset for the compressed data and the extra
  information that has been skipped before.
  Now
  a) is this only necessary if decompression or extra data is requested
  b) does that cause a lot of stress for the filesystem as larg scale seeking
     throughout the zip file is necessary
  c) is the iteration order of the entries _NOT_ with increasing file offset
     but randomly because the method is iterating over the values collection
     of the hashtable. That really bogs down the performance (for uncached
     files).

My ideas are:
- Add static methods to the ZipDatatype classes.
- Parse extra data when requested.
- Read local header only if necessary (extra data, decompression)
- If not then at least read the local headers in the right order.

I implemented lazy header reading and this sped up opening a file over a "slow" 
network connection from 18 to 2 seconds. Ordered header reading brought it down 
to 12 seconds.
The decompression to /dev/null isn't all that much slower than the java 
implementation, no matter if lazy local headers were used or not.

I also didn't graps the idea behind the two ZipFile tables "entries" 
and "dataOffsets", I simply store those two offsets in the ZipEntry instance.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 31930] - Zip & Unzip tasks major slowdown

Reply via email to