I realise that I haven’t done a very good job of documenting the code in 
Corinthia, as you’ve probably noticed :) I’ve been meaning to get around to 
this for a while now.

There’s an “abstract class” (or, more accurately, interface) called DFStorage 
which abstracts over different ways of storing “files” (byte arrays/byte 
streams) - the concrete implementations are 1) memory 2) a directory in the 
filesystem and 3) a zip file.

The idea with the zip implementation of DFStorage is that you create such an 
object, and when you read from or write to it, it works directly from the zip 
file. For example:

DFStorage *st = DFStorageOpenZip(“filename.docx”,&error);
if (st != NULL) {
    DFBuffer *foo = DFStorageRead(st,…)
    DFStorageWrite(st,…)
    // etc.
    DFStorageSave(st);
    DFStorageRelease(foo);
}

This way, filter code doesn’t need to care how the data is actually stored.

Now with the current implementation, which is a very simplistic one, it simply 
reads the whole zip file into memory. This is largely due to a limitation in 
the minizip API, which enforces sequential access to the entries in a file. It 
would be conceivable to have the zip DFStorage implementation first read a 
directory listing, and then for each file that’s requested, do a linear scan 
through all the entries before finding the requested file, and then reading 
that. This would be an O(n) operation, but would be unlikely to be a major 
problem since most zip packages we’re dealing with will only have a fairly 
small number of entries.

Minizip does not provide any way to cache the location in the zip file of a 
particular entry, even though this information would be possible to obtain in 
theory (just not through minizip’s AP). If I were writing a zip implementation 
from scratch (and maybe this is something we could consider), I would have it 
read a list of all entries and remember their locations in a hash table, so 
that when a particular named entry is requested, we can go directly to that 
point in the file without having to do a linear scan.

Writing to zip files is another inconvenient thing, because it’s really only 
possible to do it in an append-only manner. If a large image is deleted from a 
document, or replaced with a modified image, we don’t want to keep the old one 
around; so instead we create an entirely new zip file and overwrite the old 
one. As for reading, the current implementation stores all the content in 
memory and then writes it out to disk in one go when you call DFStorageSave(). 
However for documents containing large images this may mean unacceptable 
amounts of memory usage, depending on the application/environment.

Coming back to what I said in my previous response to Jan about having multiple 
zip files open at a time, it’s not done at the moment within the context of a 
single conversion (but can happen with multiple threads if there are several 
conversions going on at the same time). However, if we were to adopt the above 
approach to limit memory usage of the zip-based DFStorage objects, and we were 
converting say directly from one zip-based file format to another (think OOXML 
to ODF), this would require the ability to have multiple zip files open at the 
same time, and in the same thread.

On the question of providing our own versions of the APIs for external 
libraries, I guess I can sort off see some benefit now there in the sense that 
we can, in theory at least, swap out a different implementation. But this is 
only possible if the other implementation works the same semantically, with 
only a different syntax. For example the use wrapper functions in an of itself 
does not really help much IMHO, as the way in which a piece of given 
functionality is exposed may differ between libraries. I’m not opposed to 
wrapper functions as in Jan’s branch as such, but it’s just some food for 
thought.

—
Dr Peter M. Kelly
[email protected]

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Reply via email to