background
----------
As recent threads on the list have discussed, we need a binary Word
exporter. The current plan is to implement this functionality in the
existing wvware framework (cvs module wv).
However, there are several steps from here to a functioning exporter.
Conveniently, many are easily modularized and appropriate for work
in parallel. So this is a multi-part, multi-hacker POW.
The following steps won't get us all the way to Word export, but it's
a beginning at assembling the various pieces needed for a functioning
exporter.
scope
-----
Part 1a.
Abstract wv's current OLE stream reads
This requires no knowledge of the Word format or, for the most part,
wv functionality. We just want to improve wv's existing file support
functions to make them a little more versatile.
wv currently uses a set of functions (in wv/support.c) along the
lines of:
U16 read_16ubit(FILE*);
U32 read_32ubit(FILE*);
U16 dread_16ubit(FILE*, U8**);
U32 dread_32ubit(FILE*, U8**);
U8 dgetc(FILE*, U8**);
The normal getc(FILE*) function is also used throughout the code.
We should modify the above functions, and all of the existing
wv code (only that which is reading from OLE streams, of course).
to make use of a wvStream* in place of a FILE*. wvStream
will initially be a typedef to FILE (ie 'typedef FILE wvStream').
Don't forget the wvOLE* functions, too, which is were the
abstraction begins. You get FILE* from the old OLE code, and you'll
cast them to wvStream* here.
Also, all getc's will be replaced with a new support function.
"U8 read_8ubit(wvStream*);" seems like a logical choice. While
we're at it, I'd like to move dgetc renamed to dread_8ubit, too.
This abstraction lets us replace the OLE back-end transparently
to the rest of the code, which is the next step.
----
Part 1b.
Add support for libole2 (read and write)
libole2 is a OLE2 library for reading and writing to OLE2 structured
storage objects (ie. Office files). It can be found in gnumeric
currently. They may have split it out into a standalone library, but
if not, just extract the code from gnumeric's CVS repository.
It should go into the wv tree, under wv/ole2 perhaps. It might make
use of glib, so we'd either want to grab that, too, or remove the
glib dependencies.
Then, write d/read_Xubit functions for using libole2 (instead of
the current fread-based implementations).
Then, write write_Xubit functions for using libole2.
Again, these functions will go in wv/support.c
We'll want to #ifdef this new code and the old read
functions with a sensible #define.
Perhaps, somthing like:
#define OLE2MODULE LIBOLE2
#define OLE2MODULE OLEDECOD // the current OLE2 code
Then, reimplement the functions in wv/laolareplace.c
for use with libole2. Specifcally, we care about:
wvOLEDecode, wvOLEFree, and wvFindEntry
We'll probably want to put these in their own file.
Perhaps wv/libole2.c?
wvOpenPreOLE (in wv/wvparse.c) only needs to cast the
input FILE* to wvStream* (which should have been accomplished
into part 1a). It just fakes the OLE streams for early Word
formats.
wvFindEntry will probably need to be changed to return OLE2
entries in a non-OLE2 library dependent way. It will only be
relevant for allowing the converter to extract arbitrary embedded
OLE2 objects (such as an Excel graph), which is definitely not an
immediately pressing need. When someone gets to this function,
we'll discuss it further.
The compilation of either:
1) wv/laolareplace.c and wv/oledecod/*
2) wv/libole2.c (or some similarly named file) and wv/ole2/*
should be conditional on the same #ifdef/#define stuff mentioned
above. Simply, link in the right set of code for the specific
OLE2 implementation.
At this point, the infrastructure is in place for writing to
streams in an OLE2 structured storage object. This is the bottom
level of structure in a Word file, and now we need code to write
pieces of the Word data into these streams in the right format.
-----
Part 2.
Write wvPut* functions.
While browsing through the wv source, you'll probably noticed
many things like wvGetFIB, wvGetBTE, etc. These functions (and
possibly some associated helper functions) read from the OLE
stream and store the information in memory (with various
structs, arrays, and lists).
Now, we need to go the other way. In the appropriate file,
create the wvPut* function (and associated helper functions,
if necessary) to write back to an OLE stream from the passed
struct.
wvPutFIB would probably be a good place to start (wv/fib.c).
It's a straightforward record, and the implementation should
be fairly obvious based upon the wvGetFIB function.
The function definition should probably be something like:
U16 wvPutFIB(FIB*, wvStream*);
We'll assume that the stream is in the right place to write
(ie. the caller seeked the stream already), and errors should
be returned via the U16.
This implementation will, of course, be making use of the
write_Xubit functions created in part 1b. If part 1b has
not been done yet, people could start writing these wvPut*
functions while just pretending that the write_Xubit
functions actually existed.
Now, there are lots of these, so many people could work
on this part, each taking a few types to implement.
To figure out exactly how you're supposed to write the data
out, you'll use a combination techniques. First, consult
the Word file format documentation.
http://busboy.sped.ukans.edu/~justin/word/
Second, look over the corresponding wvGet function. If the
documentation and implementation differ, follow the
implementation.
By the time we're done with these functions, we are capable
of writing a complete Word document. HOWEVER, we don't have
any of the logic to populate all of these Word structs or
sequence all of the wvPut* calls so that the data is in the
appropriate order and location.
If people are interested in working on this further, I'll
write up POWs for the "logic" steps, too.
hints
-----
Include debugging trace messages liberally.
wvTrace(("status messages, helpful in tracking things down"));
wvWarning("something is strange and might indicate a problem");
wvError(("critical problem, something's defintely wrong"));
These work just like Abi's UT_DEBUGMSG(("%s", "version"));
except for wvWarning, which only has one set of parentheses.
wvTrace only show up in DEBUG builds. The others always show up.
Also, the Word 97 format can be found at:
http://busboy.sped.ukans.edu/~justin/word/
(scary, isn't it?)
extra credit
------------
If you've gotten this far, and the Word format hasn't driven you
insane, here are some ideas for a next step. Just as a warning,
these will require a better understanding of the Word format.
a) Write a CHP/PAP/SEP compressor to generate CHPX/PAPX/SEPX
SPRMs based on the property's base style.
b) Encode these SPRMs into the appropriate storage structure
for the associated exception run (such as FKP BTEs)
c) Write an escher wrapping function to encode bitmaps for
storage in the data stream
d) If you want more ideas or more explanation,
email me ([EMAIL PROTECTED])
----
PS: For more background on the whole POW / ZAP / SHAZAM concept, see
the following introduction:
http://www.abisource.com/mailinglists/abiword-dev/99/September/0097.html
Justin Bradford
[EMAIL PROTECTED]