Daniel L. Johnson wrote:
> Dear All,
> 
> Anybody here know of a tool to convert MicroSoft Word files to XML or
> HTML?  We have a huge archive of Word files...

What sort of XML? Ms-Word saves its documents as XML - but the DTD used
is proprietary.

As Ignacio said, MS Word can save as HTML, but the resulting HTML files
are full of proprietary Microsoft extensions to HTML. MS-Word 2002 and
later offer a choice to safe as "filtered HTML" which is a bit cleaner,
but still horrible.

The best way to convert MS-Word files to an open standards-based XML
format is to use a beta version of the forthcoming OpenOffice 2.0 - see
http://www.openoffice.org/  The beta versions work fine, and will save
to the OASIS OpenDocument XML standards (see
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ).
Actualy, I think OpenOffice 1.1.4 also allows you to save to
OpenDocument format, but the OpenOffice 2.0 beta will do a better job at
importing complex MS-Word documents (especially if they have nested
tables).

It should be easy to write a macro to automate the conversion, or you
can drive OpenOffice from a Python script via PyUNO if you are keen.

Tim C

Reply via email to