I have a question that may just be one of nomenclature, ...
-- replying below to --
From: Peter Kelly [mailto:[email protected]]
Sent: Thursday, January 1, 2015 02:58
To: [email protected]
Subject: DFStorage
I realise that I haven’t done a very good job of documenting the code in
Corinthia, as you’ve probably noticed :) I’ve been meaning to get around to
this for a while now.
[ ... ]
Now with the current implementation, which is a very simplistic one, it simply
reads the whole zip file into memory. This is largely due to a limitation in
the minizip API, which enforces sequential access to the entries in a file. It
would be conceivable to have the zip DFStorage implementation first read a
directory listing, and then for each file that’s requested, do a linear scan
through all the entries before finding the requested file, and then reading
that. This would be an O(n) operation, but would be unlikely to be a major
problem since most zip packages we’re dealing with will only have a fairly
small number of entries.
Minizip does not provide any way to cache the location in the zip file of a
particular entry, even though this information would be possible to obtain in
theory (just not through minizip’s AP). If I were writing a zip implementation
from scratch (and maybe this is something we could consider), I would have it
read a list of all entries and remember their locations in a hash table, so
that when a particular named entry is requested, we can go directly to that
point in the file without having to do a linear scan.
[ ... ]
<orcmid>
@Peter, I want to verify that we have the same understanding of the Zip file.
The Zip file itself has a global directory to all of the component files at
the end of the file. The global directory provides offsets to where each
component file begins in the Zip stream and also provides other pertinent
information.
To produce a Zip file, minizip would need to remember all of this to append
to the stream once all of the part files are written out.
The global directory could certainly be cached and, if necessary, indexed from
a hash table on the names of the component parts.
Without looking at minizip, I would assume that there has to be some
internal representation of the global directory even if it is not exposed.
Would it be useful to exploit that somehow in elevating a better API?
So long as the Zip stream can be read via random access, it is normal to
access the global directory first and then access the parts based on the global
directory, even if access is in sequential order of those parts in the stream.
That helps detect apparent corruption of the Zip and it is essential when the
header for a component file does not specify the length of the file data.
Does this square with your understanding of what is involved in minizip
operation?
</orcmid>