Daniel L. Johnson wrote: > Dear All, > > Anybody here know of a tool to convert MicroSoft Word files to XML or > HTML? We have a huge archive of Word files...
What sort of XML? Ms-Word saves its documents as XML - but the DTD used is proprietary. As Ignacio said, MS Word can save as HTML, but the resulting HTML files are full of proprietary Microsoft extensions to HTML. MS-Word 2002 and later offer a choice to safe as "filtered HTML" which is a bit cleaner, but still horrible. The best way to convert MS-Word files to an open standards-based XML format is to use a beta version of the forthcoming OpenOffice 2.0 - see http://www.openoffice.org/ The beta versions work fine, and will save to the OASIS OpenDocument XML standards (see http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ). Actualy, I think OpenOffice 1.1.4 also allows you to save to OpenDocument format, but the OpenOffice 2.0 beta will do a better job at importing complex MS-Word documents (especially if they have nested tables). It should be easy to write a macro to automate the conversion, or you can drive OpenOffice from a Python script via PyUNO if you are keen. Tim C
