Hello Avi,
That's a good point. The more "consistent" and "well-structured" the
document - the easier it is to extract meaningful information out of the
document. Of course, if "styles" are used everywhere - then this is a big
plus - but we all know that authors don't always use styles.
If not, however, you can still define rules based on formatting, based on
content, based on examples, etc. - for example even though a style may not
have been used for caption - you can find out that in this document most
captions are bold, size 9, except for a few, which occur just after an image
but are actually size 10 - and all of the captions have one of the following
terms in them: Figure X or Listing X or Table X in them. For most
documents you can define rules that pick up 80%-90% of the formatting
conventions.
We provide a visual interface for defining rules of arbitrary complexity
that can be used to identify formatting and then extract content. Because
we are starting sometimes from PDF files, sometimes from HTML files, and
sometimes from Word files, we aren't entirely dependent on styles, which may
or may not be used in a particular document.
Our process consists of:
-Normalization - we first produce a basic XML document using very simple
XML that can be used to reproduce the original document as well as used for
parsing. It is much easier to parse this XML document than it is to parse a
typical HTML file or word file.
-Interpretation - this paragraph is not just a paragraph - it is the
"TITLE" paragraph, or the "SUBTITLE" paragraph, or a section header, or a
caption, etc. This is where the rules related to formatting and placement
on the page coming in.
-Extraction - pulling out information - such as product names, dates, phone
numbers, etc. out of the middle of paragraphs
-Arrangement - Finally arranging the meaningful elements into the hierarchy
required by the target XML DTD or schema. Some schemas are very
hierarchical while others are relatively flat.
The current release uses "precise" rules. An upcoming release lets you
define "imprecise", or "fuzzy" rules, which allow you to account for a
certain amount of vagueness in the definition of your rules to match the
vagueness that your authors may have had when they created the document.
For more info, check out our white paper on converting unstructured content
into meaningful XML, at: http:www.cambridgedocs.com .
Hope this helps,
Thanks,
Riz
------------------------------
Riz Virk, (617) 905-3518
[EMAIL PROTECTED], [EMAIL PROTECTED]
http://www.cambridgedocs.com
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of [EMAIL PROTECTED]
Sent: Monday, November 18, 2002 1:06 AM
To: 'CMS List'
Subject: RE: [cms-list] MSWORD to XML (FrameMaker - .doc > .pdf >
.xml...?)
At 10:19 AM -0500 11/17/02, Rizwan Virk wrote:
>Hi Mike,
>
>Our company has a product that is currently in beta for converting files
>into meaningful (ie.e useful) XML. The product, the xDoc Converter, can
>convert Word and HTML files into various XML formats. We will add PDF to
>XML conversion capabilities very soon. For more information, check us out
>at http://www.cambridgedocs.com.
What do you do about vaguely structured text which doesn't use named
styles? There are so very many of them in Word... They do have a
title, they do have headings, but they're deeply inconsistent. I
know because I reformat them before I make any changes.
Avi
PS Converting to FrameMaker will not add named styles or structure to
an unorganized document. It's a great tech writing program but
cannot perform miracles.
--
Complete Guide to Search Engines for Web Sites and Intranets
<http://www.searchtools.com>
--
http://cms-list.org/
trim your replies for good karma.
--
http://cms-list.org/
trim your replies for good karma.