Re: M$oft Word to XML or HTML conversion

2005-03-30 Thread Adrian Midgley
On Wed, 2005-03-30 at 00:32, Dean Hopstein wrote:
 I've been using antiword on linux for some time with great success if
 you're only needing to maintain basic format structures into HTML.

Yes.

I suspect that when converting a large corpus of data like that, one
would prefer to convert the hidden fields in the Word files into
metadata along the lines of the Dublin Core or something more
specialised and particular to the establishment, if it is in XML.

I may be wrong, but I can imagine someone saying give me all the
documents from 1999 written by Dr Smith (of course, whether Dr Smith's
letters are actually identified by any computable field in a Word
document is another matter.

Handling correspondence, and moving correspondence from one place to
another after the patient, is a topic worth considering, and one that a
FLOSS approach to may be viable.




Re: M$oft Word to XML or HTML conversion

2005-03-29 Thread Dean Hopstein
I've been using antiword on linux for some time with great success if
you're only needing to maintain basic format structures into HTML.

On Thu, 2005-03-17 at 13:13 +, Adrian Midgley wrote:
 On Thu, 2005-03-17 at 02:12, Daniel L. Johnson wrote:
 
  Anybody here know of a tool to convert MicroSoft Word files to XML or
  HTML?  We have a huge archive of Word files...
 
 A firm called Graphnet have a set of tools for converting Word to XML.
 
 They did the Oswestry and Winchester hospital projects in the UK.
 
  20 000 sounds to me as though sub-contracting may be in the frame...
-- 
Next scheduled vacation 3/17/2005-3/18/2005


Dean Hopstein
Senior Interface Analyst
Intranet Coordinator/Developer
Hattiesburg Clinic
415 South 28th Avenue
Hattiesburg, MS 39401

Tel:  (601) 579-5279
Pager:(601) 459-0428 
Pager Email: [EMAIL PROTECTED]


The information contained in this e-mail message is confidential and
protected by law.  The information is intended only for the person or
organization addressed in this e-mail.  If you share or copy the
information you may be breaking the law.  If you have received this
e-mail by mistake, please notify the sender of the e-mail by the
telephone number listed on this e-mail.  Please destroy the original; do
not e-mail back the information or keep the original.


signature.asc
Description: This is a digitally signed message part


RE: M$oft Word to XML or HTML conversion

2005-03-19 Thread Calle Hedberg
Hi,

Tim has a point with OpenOffice 2, but be aware that the beta version is
buggy (I got tired of it bombing out on me and removed it until a more
stable version is avaiable). In particular, I found it nearly impossible to
open large files (I have lots of Excel pivot table files in the 50-300MB
range and some large Word files with embedded data). Complex word files
(graphics/tables/etc) would often come out funny.

So if you use that kind of tool in batch, I would make sure I twin every
XML version with the original Word file so that users easily can go back to
the original if they find the converted version messed up. With thousands of
files converted in batch mode, assume that some of them won't be looked at
by a sober human for maybe 10 or 15 years.

Best regards
Calle


 -Original Message-
 From: Tim Churches [mailto:[EMAIL PROTECTED] 
 Sent: 16 March 2005 06:49 PM
 To: openhealth-list@minoru-development.com
 Subject: Re: M$oft Word to XML or HTML conversion
 
 Daniel L. Johnson wrote:
  Dear All,
  
  Anybody here know of a tool to convert MicroSoft Word files 
 to XML or 
  HTML?  We have a huge archive of Word files...
 
 What sort of XML? Ms-Word saves its documents as XML - but 
 the DTD used is proprietary.
 
 As Ignacio said, MS Word can save as HTML, but the resulting 
 HTML files are full of proprietary Microsoft extensions to 
 HTML. MS-Word 2002 and later offer a choice to safe as 
 filtered HTML which is a bit cleaner, but still horrible.
 
 The best way to convert MS-Word files to an open 
 standards-based XML format is to use a beta version of the 
 forthcoming OpenOffice 2.0 - see http://www.openoffice.org/  
 The beta versions work fine, and will save to the OASIS 
 OpenDocument XML standards (see 
 http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ).
 Actualy, I think OpenOffice 1.1.4 also allows you to save to 
 OpenDocument format, but the OpenOffice 2.0 beta will do a 
 better job at importing complex MS-Word documents (especially 
 if they have nested tables).
 
 It should be easy to write a macro to automate the 
 conversion, or you can drive OpenOffice from a Python script 
 via PyUNO if you are keen.
 
 Tim C
 
 
 




Re: M$oft Word to XML or HTML conversion

2005-03-19 Thread Tim Churches
Calle Hedberg wrote:
 Hi,
 
 Tim has a point with OpenOffice 2, but be aware that the beta version is
 buggy (I got tired of it bombing out on me and removed it until a more
 stable version is avaiable). In particular, I found it nearly impossible to
 open large files (I have lots of Excel pivot table files in the 50-300MB
 range and some large Word files with embedded data). Complex word files
 (graphics/tables/etc) would often come out funny.

A 300MB spreadsheet...shudder! I must admit that I haven't used
OpenOffice 2 beta very much, which is perhaps why I haven't encoutered a
 crash, and any Word files I convert tend to be fairly simple.

 So if you use that kind of tool in batch, I would make sure I twin every
 XML version with the original Word file so that users easily can go back to
 the original if they find the converted version messed up. With thousands of
 files converted in batch mode, assume that some of them won't be looked at
 by a sober human for maybe 10 or 15 years.

Perhaps twin the XML with a PDF of the original Word file, since you
don't want those sober humans in 10 or 15 years time to have to mortgage
their house to buy an annual license for Microsoft Office Longhorn XXXP
2020 which they then have to install their computer onto (by 2020,
computer hardware is very cheap, but proprietary software is very
expensive - due to its tiny market share - so you install special
purpose hardware onto the software in oeder to run it, not vice-versa as
we do now...).

Tim C

-Original Message-
From: Tim Churches [mailto:[EMAIL PROTECTED] 
Sent: 16 March 2005 06:49 PM
To: openhealth-list@minoru-development.com
Subject: Re: M$oft Word to XML or HTML conversion

Daniel L. Johnson wrote:

Dear All,

Anybody here know of a tool to convert MicroSoft Word files 

to XML or 

HTML?  We have a huge archive of Word files...

What sort of XML? Ms-Word saves its documents as XML - but 
the DTD used is proprietary.

As Ignacio said, MS Word can save as HTML, but the resulting 
HTML files are full of proprietary Microsoft extensions to 
HTML. MS-Word 2002 and later offer a choice to safe as 
filtered HTML which is a bit cleaner, but still horrible.

The best way to convert MS-Word files to an open 
standards-based XML format is to use a beta version of the 
forthcoming OpenOffice 2.0 - see http://www.openoffice.org/  
The beta versions work fine, and will save to the OASIS 
OpenDocument XML standards (see 
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ).
Actualy, I think OpenOffice 1.1.4 also allows you to save to 
OpenDocument format, but the OpenOffice 2.0 beta will do a 
better job at importing complex MS-Word documents (especially 
if they have nested tables).

It should be easy to write a macro to automate the 
conversion, or you can drive OpenOffice from a Python script 
via PyUNO if you are keen.

Tim C



 
 
 
 



Re: M$oft Word to XML or HTML conversion

2005-03-17 Thread Heitzso
Daniel L. Johnson wrote:
Dear All,
Anybody here know of a tool to convert MicroSoft Word files to XML or
HTML?  We have a huge archive of Word files...
Thanks,
Dan Johnson md

antiword does a reasonable job in batch mode (cygwin, other sources)
then there are a couple txt2html variants, the one's I've tried do
a reasonable job


Re: M$oft Word to XML or HTML conversion

2005-03-17 Thread Adrian Midgley
On Thu, 2005-03-17 at 02:12, Daniel L. Johnson wrote:

 Anybody here know of a tool to convert MicroSoft Word files to XML or
 HTML?  We have a huge archive of Word files...

A firm called Graphnet have a set of tools for converting Word to XML.

They did the Oswestry and Winchester hospital projects in the UK.

 20 000 sounds to me as though sub-contracting may be in the frame...
-- 
Adrian MidgleyFLOSS  regularly



M$oft Word to XML or HTML conversion

2005-03-16 Thread Daniel L. Johnson
Dear All,

Anybody here know of a tool to convert MicroSoft Word files to XML or
HTML?  We have a huge archive of Word files...

Thanks,

Dan Johnson md




Re: M$oft Word to XML or HTML conversion

2005-03-16 Thread Ignacio Valdes
On 16 Mar 2005 20:12:49 -0600
 Daniel L. Johnson [EMAIL PROTECTED] wrote:
Dear All,
Anybody here know of a tool to convert MicroSoft Word files to XML 
or
HTML?  We have a huge archive of Word files...

Thanks,
Dan Johnson md
Can't you Save As in Word to html format?
-- IV


Re: M$oft Word to XML or HTML conversion

2005-03-16 Thread Daniel L. Johnson
On Wed, 2005-03-16 at 20:23, Ignacio Valdes wrote:
 On 16 Mar 2005 20:12:49 -0600
   Daniel L. Johnson [EMAIL PROTECTED] wrote:
  Dear All,
  
  Anybody here know of a tool to convert MicroSoft Word files to XML 
 or
  HTML?  We have a huge archive of Word files...
  
  Thanks,
  
  Dan Johnson md
  
 Can't you Save As in Word to html format?

That's not the question.  We have tens of thousands of .doc files  that
we need to convert in batch or on the fly to xml or hmtl.



Re: M$oft Word to XML or HTML conversion

2005-03-16 Thread Tim Churches
Daniel L. Johnson wrote:
 Dear All,
 
 Anybody here know of a tool to convert MicroSoft Word files to XML or
 HTML?  We have a huge archive of Word files...

What sort of XML? Ms-Word saves its documents as XML - but the DTD used
is proprietary.

As Ignacio said, MS Word can save as HTML, but the resulting HTML files
are full of proprietary Microsoft extensions to HTML. MS-Word 2002 and
later offer a choice to safe as filtered HTML which is a bit cleaner,
but still horrible.

The best way to convert MS-Word files to an open standards-based XML
format is to use a beta version of the forthcoming OpenOffice 2.0 - see
http://www.openoffice.org/  The beta versions work fine, and will save
to the OASIS OpenDocument XML standards (see
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ).
Actualy, I think OpenOffice 1.1.4 also allows you to save to
OpenDocument format, but the OpenOffice 2.0 beta will do a better job at
importing complex MS-Word documents (especially if they have nested
tables).

It should be easy to write a macro to automate the conversion, or you
can drive OpenOffice from a Python script via PyUNO if you are keen.

Tim C



Re: M$oft Word to XML or HTML conversion

2005-03-16 Thread Joseph Dal Molin
I think Dan is talking about documents from older versions of Word which 
are pre XML if I'm not mistaken.there are Word to html converters 
for Linuxbut the OO approach sounds like the best 
choice...especially with the Python script capability. One thing to note 
is that OO 2 still doesn't do tables perfectlyif you have text that 
flows down a column and over to another page you can run into 
problemsit has  sort of been fixed but crashes quite a bit for me.

Joseph
Tim Churches wrote:
Daniel L. Johnson wrote:
Dear All,
Anybody here know of a tool to convert MicroSoft Word files to XML or
HTML?  We have a huge archive of Word files...

What sort of XML? Ms-Word saves its documents as XML - but the DTD used
is proprietary.
As Ignacio said, MS Word can save as HTML, but the resulting HTML files
are full of proprietary Microsoft extensions to HTML. MS-Word 2002 and
later offer a choice to safe as filtered HTML which is a bit cleaner,
but still horrible.
The best way to convert MS-Word files to an open standards-based XML
format is to use a beta version of the forthcoming OpenOffice 2.0 - see
http://www.openoffice.org/  The beta versions work fine, and will save
to the OASIS OpenDocument XML standards (see
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office ).
Actualy, I think OpenOffice 1.1.4 also allows you to save to
OpenDocument format, but the OpenOffice 2.0 beta will do a better job at
importing complex MS-Word documents (especially if they have nested
tables).
It should be easy to write a macro to automate the conversion, or you
can drive OpenOffice from a Python script via PyUNO if you are keen.
Tim C
.