I realise that I haven’t done a very good job of documenting the code in
Corinthia, as you’ve probably noticed :) I’ve been meaning to get around to
this for a while now.
There’s an “abstract class” (or, more accurately, interface) called DFStorage
which abstracts over different ways of storing “files” (byte arrays/byte
streams) - the concrete implementations are 1) memory 2) a directory in the
filesystem and 3) a zip file.
The idea with the zip implementation of DFStorage is that you create such an
object, and when you read from or write to it, it works directly from the zip
file. For example:
DFStorage *st = DFStorageOpenZip(“filename.docx”,&error);
if (st != NULL) {
DFBuffer *foo = DFStorageRead(st,…)
DFStorageWrite(st,…)
// etc.
DFStorageSave(st);
DFStorageRelease(foo);
}
This way, filter code doesn’t need to care how the data is actually stored.
Now with the current implementation, which is a very simplistic one, it simply
reads the whole zip file into memory. This is largely due to a limitation in
the minizip API, which enforces sequential access to the entries in a file. It
would be conceivable to have the zip DFStorage implementation first read a
directory listing, and then for each file that’s requested, do a linear scan
through all the entries before finding the requested file, and then reading
that. This would be an O(n) operation, but would be unlikely to be a major
problem since most zip packages we’re dealing with will only have a fairly
small number of entries.
Minizip does not provide any way to cache the location in the zip file of a
particular entry, even though this information would be possible to obtain in
theory (just not through minizip’s AP). If I were writing a zip implementation
from scratch (and maybe this is something we could consider), I would have it
read a list of all entries and remember their locations in a hash table, so
that when a particular named entry is requested, we can go directly to that
point in the file without having to do a linear scan.
Writing to zip files is another inconvenient thing, because it’s really only
possible to do it in an append-only manner. If a large image is deleted from a
document, or replaced with a modified image, we don’t want to keep the old one
around; so instead we create an entirely new zip file and overwrite the old
one. As for reading, the current implementation stores all the content in
memory and then writes it out to disk in one go when you call DFStorageSave().
However for documents containing large images this may mean unacceptable
amounts of memory usage, depending on the application/environment.
Coming back to what I said in my previous response to Jan about having multiple
zip files open at a time, it’s not done at the moment within the context of a
single conversion (but can happen with multiple threads if there are several
conversions going on at the same time). However, if we were to adopt the above
approach to limit memory usage of the zip-based DFStorage objects, and we were
converting say directly from one zip-based file format to another (think OOXML
to ODF), this would require the ability to have multiple zip files open at the
same time, and in the same thread.
On the question of providing our own versions of the APIs for external
libraries, I guess I can sort off see some benefit now there in the sense that
we can, in theory at least, swap out a different implementation. But this is
only possible if the other implementation works the same semantically, with
only a different syntax. For example the use wrapper functions in an of itself
does not really help much IMHO, as the way in which a piece of given
functionality is exposed may differ between libraries. I’m not opposed to
wrapper functions as in Jan’s branch as such, but it’s just some food for
thought.
—
Dr Peter M. Kelly
[email protected]
PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)