POW -- Beginning the Binary Word Exporter

Justin Bradford Wed, 15 Mar 2000 17:28:27 -0600 (CST)

background
----------
As recent threads on the list have discussed, we need a binary Word 
exporter. The current plan is to implement this functionality in the 
existing wvware framework (cvs module wv).

However, there are several steps from here to a functioning exporter.
Conveniently, many are easily modularized and appropriate for work
in parallel. So this is a multi-part, multi-hacker POW.

The following steps won't get us all the way to Word export, but it's 
a beginning at assembling the various pieces needed for a functioning 
exporter.


scope
-----
Part 1a.
Abstract wv's current OLE stream reads

This requires no knowledge of the Word format or, for the most part,
wv functionality. We just want to improve wv's existing file support
functions to make them a little more versatile.

wv currently uses a set of functions (in wv/support.c) along the
lines of:
U16 read_16ubit(FILE*);
U32 read_32ubit(FILE*);
U16 dread_16ubit(FILE*, U8**);
U32 dread_32ubit(FILE*, U8**);
U8 dgetc(FILE*, U8**);

The normal getc(FILE*) function is also used throughout the code.

We should modify the above functions, and all of the existing
wv code (only that which is reading from OLE streams, of course).
to make use of a wvStream* in place of a FILE*. wvStream 
will initially be a typedef to FILE (ie 'typedef FILE wvStream').

Don't forget the wvOLE* functions, too, which is were the 
abstraction begins. You get FILE* from the old OLE code, and you'll
cast them to wvStream* here.

Also, all getc's will be replaced with a new support function.
"U8 read_8ubit(wvStream*);" seems like a logical choice. While
we're at it, I'd like to move dgetc renamed to dread_8ubit, too.

This abstraction lets us replace the OLE back-end transparently
to the rest of the code, which is the next step.

----
Part 1b.
Add support for libole2 (read and write)

libole2 is a OLE2 library for reading and writing to OLE2 structured
storage objects (ie. Office files). It can be found in gnumeric 
currently. They may have split it out into a standalone library, but 
if not, just extract the code from gnumeric's CVS repository.

It should go into the wv tree, under wv/ole2 perhaps. It might make
use of glib, so we'd either want to grab that, too, or remove the
glib dependencies.

Then, write d/read_Xubit functions for using libole2 (instead of
the current fread-based implementations).

Then, write write_Xubit functions for using libole2.

Again, these functions will go in wv/support.c

We'll want to #ifdef this new code and the old read 
functions with a sensible #define. 
Perhaps, somthing like:
#define OLE2MODULE LIBOLE2
#define OLE2MODULE OLEDECOD // the current OLE2 code

Then, reimplement the functions in wv/laolareplace.c
for use with libole2. Specifcally, we care about:

wvOLEDecode, wvOLEFree, and wvFindEntry

We'll probably want to put these in their own file.
Perhaps wv/libole2.c?

wvOpenPreOLE (in wv/wvparse.c) only needs to cast the
input FILE* to wvStream* (which should have been accomplished
into part 1a). It just fakes the OLE streams for early Word
formats.

wvFindEntry will probably need to be changed to return OLE2 
entries in a non-OLE2 library dependent way. It will only be
relevant for allowing the converter to extract arbitrary embedded 
OLE2 objects (such as an Excel graph), which is definitely not an 
immediately pressing need. When someone gets to this function, 
we'll discuss it further.

The compilation of either:

1) wv/laolareplace.c and wv/oledecod/*
2) wv/libole2.c (or some similarly named file) and wv/ole2/* 

should be conditional on the same #ifdef/#define stuff mentioned
above. Simply, link in the right set of code for the specific
OLE2 implementation.

At this point, the infrastructure is in place for writing to
streams in an OLE2 structured storage object. This is the bottom
level of structure in a Word file, and now we need code to write 
pieces of the Word data into these streams in the right format.

-----
Part 2.
Write wvPut* functions.

While browsing through the wv source, you'll probably noticed
many things like wvGetFIB, wvGetBTE, etc. These functions (and
possibly some associated helper functions) read from the OLE 
stream and store the information in memory (with various 
structs, arrays, and lists).

Now, we need to go the other way. In the appropriate file,
create the wvPut* function (and associated helper functions,
if necessary) to write back to an OLE stream from the passed
struct.

wvPutFIB would probably be a good place to start (wv/fib.c).
It's a straightforward record, and the implementation should 
be fairly obvious based upon the wvGetFIB function.

The function definition should probably be something like:
U16 wvPutFIB(FIB*, wvStream*);

We'll assume that the stream is in the right place to write
(ie. the caller seeked the stream already), and errors should 
be returned via the U16.

This implementation will, of course, be making use of the
write_Xubit functions created in part 1b. If part 1b has
not been done yet, people could start writing these wvPut*
functions while just pretending that the write_Xubit 
functions actually existed.

Now, there are lots of these, so many people could work
on this part, each taking a few types to implement.

To figure out exactly how you're supposed to write the data
out, you'll use a combination techniques. First, consult 
the Word file format documentation.

http://busboy.sped.ukans.edu/~justin/word/ 

Second, look over the corresponding wvGet function. If the
documentation and implementation differ, follow the 
implementation.

By the time we're done with these functions, we are capable
of writing a complete Word document. HOWEVER, we don't have 
any of the logic to populate all of these Word structs or 
sequence all of the wvPut* calls so that the data is in the 
appropriate order and location.

If people are interested in working on this further, I'll
write up POWs for the "logic" steps, too.


hints
-----
Include debugging trace messages liberally.

wvTrace(("status messages, helpful in tracking things down"));
wvWarning("something is strange and might indicate a problem");
wvError(("critical problem, something's defintely wrong"));

These work just like Abi's UT_DEBUGMSG(("%s", "version"));
except for wvWarning, which only has one set of parentheses.

wvTrace only show up in DEBUG builds. The others always show up.

Also, the Word 97 format can be found at:
http://busboy.sped.ukans.edu/~justin/word/ 
(scary, isn't it?)


extra credit
------------
If you've gotten this far, and the Word format hasn't driven you
insane, here are some ideas for a next step. Just as a warning,
these will require a better understanding of the Word format.

a) Write a CHP/PAP/SEP compressor to generate CHPX/PAPX/SEPX
SPRMs based on the property's base style. 

b) Encode these SPRMs into the appropriate storage structure
for the associated exception run (such as FKP BTEs)

c) Write an escher wrapping function to encode bitmaps for
storage in the data stream

d) If you want more ideas or more explanation, 
email me ([EMAIL PROTECTED])

----

PS:  For more background on the whole POW / ZAP / SHAZAM concept, see
the following introduction:
 
http://www.abisource.com/mailinglists/abiword-dev/99/September/0097.html


Justin Bradford
[EMAIL PROTECTED]
POW -- Beginning the Binary Word Exporter

Reply via email to