Re: M$oft Word to XML or HTML conversion
On Wed, 2005-03-30 at 00:32, Dean Hopstein wrote: I've been using antiword on linux for some time with great success if you're only needing to maintain basic format structures into HTML. Yes. I suspect that when converting a large corpus of data like that, one would prefer to convert the hidden fields in the Word files into metadata along the lines of the Dublin Core or something more specialised and particular to the establishment, if it is in XML. I may be wrong, but I can imagine someone saying give me all the documents from 1999 written by Dr Smith (of course, whether Dr Smith's letters are actually identified by any computable field in a Word document is another matter. Handling correspondence, and moving correspondence from one place to another after the patient, is a topic worth considering, and one that a FLOSS approach to may be viable.
Re: M$oft Word to XML or HTML conversion
I've been using antiword on linux for some time with great success if you're only needing to maintain basic format structures into HTML. On Thu, 2005-03-17 at 13:13 +, Adrian Midgley wrote: On Thu, 2005-03-17 at 02:12, Daniel L. Johnson wrote: Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... A firm called Graphnet have a set of tools for converting Word to XML. They did the Oswestry and Winchester hospital projects in the UK. 20 000 sounds to me as though sub-contracting may be in the frame... -- Next scheduled vacation 3/17/2005-3/18/2005 Dean Hopstein Senior Interface Analyst Intranet Coordinator/Developer Hattiesburg Clinic 415 South 28th Avenue Hattiesburg, MS 39401 Tel: (601) 579-5279 Pager:(601) 459-0428 Pager Email: [EMAIL PROTECTED] The information contained in this e-mail message is confidential and protected by law. The information is intended only for the person or organization addressed in this e-mail. If you share or copy the information you may be breaking the law. If you have received this e-mail by mistake, please notify the sender of the e-mail by the telephone number listed on this e-mail. Please destroy the original; do not e-mail back the information or keep the original. signature.asc Description: This is a digitally signed message part
RE: M$oft Word to XML or HTML conversion
Hi, Tim has a point with OpenOffice 2, but be aware that the beta version is buggy (I got tired of it bombing out on me and removed it until a more stable version is avaiable). In particular, I found it nearly impossible to open large files (I have lots of Excel pivot table files in the 50-300MB range and some large Word files with embedded data). Complex word files (graphics/tables/etc) would often come out funny. So if you use that kind of tool in batch, I would make sure I twin every XML version with the original Word file so that users easily can go back to the original if they find the converted version messed up. With thousands of files converted in batch mode, assume that some of them won't be looked at by a sober human for maybe 10 or 15 years. Best regards Calle -Original Message- From: Tim Churches [mailto:[EMAIL PROTECTED] Sent: 16 March 2005 06:49 PM To: openhealth-list@minoru-development.com Subject: Re: M$oft Word to XML or HTML conversion Daniel L. Johnson wrote: Dear All, Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... What sort of XML? Ms-Word saves its documents as XML - but the DTD used is proprietary. As Ignacio said, MS Word can save as HTML, but the resulting HTML files are full of proprietary Microsoft extensions to HTML. MS-Word 2002 and later offer a choice to safe as filtered HTML which is a bit cleaner, but still horrible. The best way to convert MS-Word files to an open standards-based XML format is to use a beta version of the forthcoming OpenOffice 2.0 - see http://www.openoffice.org/ The beta versions work fine, and will save to the OASIS OpenDocument XML standards (see http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ). Actualy, I think OpenOffice 1.1.4 also allows you to save to OpenDocument format, but the OpenOffice 2.0 beta will do a better job at importing complex MS-Word documents (especially if they have nested tables). It should be easy to write a macro to automate the conversion, or you can drive OpenOffice from a Python script via PyUNO if you are keen. Tim C
Re: M$oft Word to XML or HTML conversion
Calle Hedberg wrote: Hi, Tim has a point with OpenOffice 2, but be aware that the beta version is buggy (I got tired of it bombing out on me and removed it until a more stable version is avaiable). In particular, I found it nearly impossible to open large files (I have lots of Excel pivot table files in the 50-300MB range and some large Word files with embedded data). Complex word files (graphics/tables/etc) would often come out funny. A 300MB spreadsheet...shudder! I must admit that I haven't used OpenOffice 2 beta very much, which is perhaps why I haven't encoutered a crash, and any Word files I convert tend to be fairly simple. So if you use that kind of tool in batch, I would make sure I twin every XML version with the original Word file so that users easily can go back to the original if they find the converted version messed up. With thousands of files converted in batch mode, assume that some of them won't be looked at by a sober human for maybe 10 or 15 years. Perhaps twin the XML with a PDF of the original Word file, since you don't want those sober humans in 10 or 15 years time to have to mortgage their house to buy an annual license for Microsoft Office Longhorn XXXP 2020 which they then have to install their computer onto (by 2020, computer hardware is very cheap, but proprietary software is very expensive - due to its tiny market share - so you install special purpose hardware onto the software in oeder to run it, not vice-versa as we do now...). Tim C -Original Message- From: Tim Churches [mailto:[EMAIL PROTECTED] Sent: 16 March 2005 06:49 PM To: openhealth-list@minoru-development.com Subject: Re: M$oft Word to XML or HTML conversion Daniel L. Johnson wrote: Dear All, Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... What sort of XML? Ms-Word saves its documents as XML - but the DTD used is proprietary. As Ignacio said, MS Word can save as HTML, but the resulting HTML files are full of proprietary Microsoft extensions to HTML. MS-Word 2002 and later offer a choice to safe as filtered HTML which is a bit cleaner, but still horrible. The best way to convert MS-Word files to an open standards-based XML format is to use a beta version of the forthcoming OpenOffice 2.0 - see http://www.openoffice.org/ The beta versions work fine, and will save to the OASIS OpenDocument XML standards (see http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ). Actualy, I think OpenOffice 1.1.4 also allows you to save to OpenDocument format, but the OpenOffice 2.0 beta will do a better job at importing complex MS-Word documents (especially if they have nested tables). It should be easy to write a macro to automate the conversion, or you can drive OpenOffice from a Python script via PyUNO if you are keen. Tim C
Re: M$oft Word to XML or HTML conversion
Daniel L. Johnson wrote: Dear All, Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... Thanks, Dan Johnson md antiword does a reasonable job in batch mode (cygwin, other sources) then there are a couple txt2html variants, the one's I've tried do a reasonable job
Re: M$oft Word to XML or HTML conversion
On Thu, 2005-03-17 at 02:12, Daniel L. Johnson wrote: Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... A firm called Graphnet have a set of tools for converting Word to XML. They did the Oswestry and Winchester hospital projects in the UK. 20 000 sounds to me as though sub-contracting may be in the frame... -- Adrian MidgleyFLOSS regularly
M$oft Word to XML or HTML conversion
Dear All, Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... Thanks, Dan Johnson md
Re: M$oft Word to XML or HTML conversion
On 16 Mar 2005 20:12:49 -0600 Daniel L. Johnson [EMAIL PROTECTED] wrote: Dear All, Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... Thanks, Dan Johnson md Can't you Save As in Word to html format? -- IV
Re: M$oft Word to XML or HTML conversion
On Wed, 2005-03-16 at 20:23, Ignacio Valdes wrote: On 16 Mar 2005 20:12:49 -0600 Daniel L. Johnson [EMAIL PROTECTED] wrote: Dear All, Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... Thanks, Dan Johnson md Can't you Save As in Word to html format? That's not the question. We have tens of thousands of .doc files that we need to convert in batch or on the fly to xml or hmtl.
Re: M$oft Word to XML or HTML conversion
Daniel L. Johnson wrote: Dear All, Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... What sort of XML? Ms-Word saves its documents as XML - but the DTD used is proprietary. As Ignacio said, MS Word can save as HTML, but the resulting HTML files are full of proprietary Microsoft extensions to HTML. MS-Word 2002 and later offer a choice to safe as filtered HTML which is a bit cleaner, but still horrible. The best way to convert MS-Word files to an open standards-based XML format is to use a beta version of the forthcoming OpenOffice 2.0 - see http://www.openoffice.org/ The beta versions work fine, and will save to the OASIS OpenDocument XML standards (see http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ). Actualy, I think OpenOffice 1.1.4 also allows you to save to OpenDocument format, but the OpenOffice 2.0 beta will do a better job at importing complex MS-Word documents (especially if they have nested tables). It should be easy to write a macro to automate the conversion, or you can drive OpenOffice from a Python script via PyUNO if you are keen. Tim C
Re: M$oft Word to XML or HTML conversion
I think Dan is talking about documents from older versions of Word which are pre XML if I'm not mistaken.there are Word to html converters for Linuxbut the OO approach sounds like the best choice...especially with the Python script capability. One thing to note is that OO 2 still doesn't do tables perfectlyif you have text that flows down a column and over to another page you can run into problemsit has sort of been fixed but crashes quite a bit for me. Joseph Tim Churches wrote: Daniel L. Johnson wrote: Dear All, Anybody here know of a tool to convert MicroSoft Word files to XML or HTML? We have a huge archive of Word files... What sort of XML? Ms-Word saves its documents as XML - but the DTD used is proprietary. As Ignacio said, MS Word can save as HTML, but the resulting HTML files are full of proprietary Microsoft extensions to HTML. MS-Word 2002 and later offer a choice to safe as filtered HTML which is a bit cleaner, but still horrible. The best way to convert MS-Word files to an open standards-based XML format is to use a beta version of the forthcoming OpenOffice 2.0 - see http://www.openoffice.org/ The beta versions work fine, and will save to the OASIS OpenDocument XML standards (see http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ). Actualy, I think OpenOffice 1.1.4 also allows you to save to OpenDocument format, but the OpenOffice 2.0 beta will do a better job at importing complex MS-Word documents (especially if they have nested tables). It should be easy to write a macro to automate the conversion, or you can drive OpenOffice from a Python script via PyUNO if you are keen. Tim C .