Re: Parsing MS Word documents

David A. Desrosiers Tue, 04 Feb 2003 08:21:38 -0800

> The limitation with this is that the user has to plan for this and setup
> the scripts for each pdb that may have a word doc.


        Not at all, you can make a script that just takes a path to a Word
doc as an argument and converts.

> Also, last I knew, the perl scripts didn't work under windows, where Word
> docs are most likely to be.

        Perl runs under Windows, so I don't see why this wouldn't work.
Also, under Windows, you don't need to use wv anymore, you can use the stock
OLE perl modules (Win32::OLE) and get the actual data out of the Word
document directly, instead of having to convert it and manage the converted
portions. This is how it's done with Excel spreadsheets being converted to
Plucker (and other formats) on Windows.

> If scripts need to be made/altered anyway, why not include it in the
> python parser?

        Not everyone uses the Python distiller code.

> Of course, I'd like to see more document types supported.  I use plucker
> more as a document reader than a content grabber.  A lot of periodic
> corporate data is distributed as Word documents or PDFs though, so
> supporting these formats in the parser (even as a wrapper to another tool
> like wv) could accomplish both.

        PDF and Word documents are easy to convert, Postscript and other
esoteric formats aren't so easy, and require a bit more "massaging", but
hopefully as more and more people begin using OpenOffice.org and the default
compressed XML format, this should be much easier to manage soon.

        Anyone know offhand why Abiword still uses PalmDOC as a SaveAs
format, and not Plucker? You lose so much formatting going from a rich text
editor like Abiword to PalmDOC, instead of Plucker. Besides, Plucker is OSS,
PalmDOC is generally not, though the format is known (and stagnating).


d.


_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev

Re: Parsing MS Word documents

Reply via email to