I think Dan is talking about documents from older versions of Word which are pre XML if I'm not mistaken.....there are Word to html converters for Linux....but the OO approach sounds like the best choice...especially with the Python script capability. One thing to note is that OO 2 still doesn't do tables perfectly....if you have text that flows down a column and over to another page you can run into problems....it has sort of been fixed but crashes quite a bit for me.

Joseph

Tim Churches wrote:
Daniel L. Johnson wrote:

Dear All,

Anybody here know of a tool to convert MicroSoft Word files to XML or
HTML?  We have a huge archive of Word files...


What sort of XML? Ms-Word saves its documents as XML - but the DTD used
is proprietary.

As Ignacio said, MS Word can save as HTML, but the resulting HTML files
are full of proprietary Microsoft extensions to HTML. MS-Word 2002 and
later offer a choice to safe as "filtered HTML" which is a bit cleaner,
but still horrible.

The best way to convert MS-Word files to an open standards-based XML
format is to use a beta version of the forthcoming OpenOffice 2.0 - see
http://www.openoffice.org/  The beta versions work fine, and will save
to the OASIS OpenDocument XML standards (see
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ).
Actualy, I think OpenOffice 1.1.4 also allows you to save to
OpenDocument format, but the OpenOffice 2.0 beta will do a better job at
importing complex MS-Word documents (especially if they have nested
tables).

It should be easy to write a macro to automate the conversion, or you
can drive OpenOffice from a Python script via PyUNO if you are keen.

Tim C


.




Reply via email to