Hello Hari,

Thanks for the question.  The product has not yet been officially released
yet, the first release will be in December.   We have a number of
organizations currently beta testing the software against real content
(we're always looking for more if you have a real life content scenario to
test against).

Thanks!
Riz



------------------------------
Riz Virk, (617) 905-3518
[EMAIL PROTECTED], [EMAIL PROTECTED]
http://www.cambridgedocs.com


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Hari M
Sent: Monday, November 18, 2002 11:11 AM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: [cms-list] MSWORD to XML (FrameMaker - .doc > .pdf >
.xml...?)



Rizwan,

Is the product ready or still under development?

Thanks,

Hari




 





>From: "Rizwan Virk"


>To:
, "'CMS List'"


>Subject: RE: [cms-list] MSWORD to XML (FrameMaker - .doc > .pdf >
.xml...?)

>Date: Mon, 18 Nov 2002 09:46:24 -0500

>

>Hello Avi,

>

>That's a good point. The more "consistent" and "well-structured" the

>document - the easier it is to extract meaningful information out of the

>document. Of course, if "styles" are used everywhere - then this is a
big

>plus - but we all know that authors don't always use styles.

>

>If not, however, you can still define rules based on formatting, based
on

>content, based on examples, etc. - for example even though a style may
not

>have been used for caption - you can find out that in this document most

>captions are bold, size 9, except for a few, which occur just after an
image

>but are actually size 10 - and all of the captions have one of the
following

>terms in them: Figure X or Listing X or Table X in them. For most

>documents you can define rules that pick up 80%-90% of the formatting

>conventions.

>

>We provide a visual interface for defining rules of arbitrary complexity

>that can be used to identify formatting and then extract content.
Because

>we are starting sometimes from PDF files, sometimes from HTML files, and

>sometimes from Word files, we aren't entirely dependent on styles, which
may

>or may not be used in a particular document.

>

>Our process consists of:

> -Normalization - we first produce a basic XML document using very
simple

>XML that can be used to reproduce the original document as well as used
for

>parsing. It is much easier to parse this XML document than it is to
parse a

>typical HTML file or word file.

>

> -Interpretation - this paragraph is not just a paragraph - it is the

>"TITLE" paragraph, or the "SUBTITLE" paragraph, or a section header, or
a

>caption, etc. This is where the rules related to formatting and
placement

>on the page coming in.

>

> -Extraction - pulling out information - such as product names, dates,
phone

>numbers, etc. out of the middle of paragraphs

>

> -Arrangement - Finally arranging the meaningful elements into the
hierarchy

>required by the target XML DTD or schema. Some schemas are very

>hierarchical while others are relatively flat.

>

>The current release uses "precise" rules. An upcoming release lets you

>define "imprecise", or "fuzzy" rules, which allow you to account for a

>certain amount of vagueness in the definition of your rules to match the

>vagueness that your authors may have had when they created the document.

>

>For more info, check out our white paper on converting unstructured
content

>into meaningful XML, at: http:www.cambridgedocs.com .

>

>Hope this helps,

>Thanks,

>Riz

>

>

>

>------------------------------

>Riz Virk, (617) 905-3518

>[EMAIL PROTECTED], [EMAIL PROTECTED]

>http://www.cambridgedocs.com

>

>

>-----Original Message-----

>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On

>Behalf Of [EMAIL PROTECTED]

>Sent: Monday, November 18, 2002 1:06 AM

>To: 'CMS List'

>Subject: RE: [cms-list] MSWORD to XML (FrameMaker - .doc > .pdf >

>.xml...?)

>

>

>At 10:19 AM -0500 11/17/02, Rizwan Virk wrote:

> >Hi Mike,

> >

> >Our company has a product that is currently in beta for converting
files

> >into meaningful (ie.e useful) XML. The product, the xDoc Converter,
can

> >convert Word and HTML files into various XML formats. We will add
PDF to

> >XML conversion capabilities very soon. For more information, check
us out

> >at http://www.cambridgedocs.com.

>

>What do you do about vaguely structured text which doesn't use named

>styles? There are so very many of them in Word... They do have a

>title, they do have headings, but they're deeply inconsistent. I

>know because I reformat them before I make any changes.

>

>Avi

>

>PS Converting to FrameMaker will not add named styles or structure to

>an unorganized document. It's a great tech writing program but

>cannot perform miracles.

>

>--

>Complete Guide to Search Engines for Web Sites and Intranets

>


>--

>http://cms-list.org/

>trim your replies for good karma.

>

>--

>http://cms-list.org/

>trim your replies for good karma.




------------------------------------------
Add photos to your e-mail with
------------------------------------------
MSN 8.
------------------------------------------
 Get 2 months FREE*.
------------------------------------------


--- StripMime Report -- processed MIME parts ---
text/html (html body -- converted)
---
--
http://cms-list.org/
trim your replies for good karma.


--
http://cms-list.org/
trim your replies for good karma.

Reply via email to