Rizwan,

Is the product ready or still under development?

Thanks,

Hari

 


 





>From: "Rizwan Virk" 


>To: 
, "'CMS List'" 


>Subject: RE: [cms-list] MSWORD to XML (FrameMaker - .doc > .pdf > .xml...?) 

>Date: Mon, 18 Nov 2002 09:46:24 -0500 

> 

>Hello Avi, 

> 

>That's a good point. The more "consistent" and "well-structured" the 

>document - the easier it is to extract meaningful information out of the 

>document. Of course, if "styles" are used everywhere - then this is a big 

>plus - but we all know that authors don't always use styles. 

> 

>If not, however, you can still define rules based on formatting, based on 

>content, based on examples, etc. - for example even though a style may not 

>have been used for caption - you can find out that in this document most 

>captions are bold, size 9, except for a few, which occur just after an image 

>but are actually size 10 - and all of the captions have one of the following 

>terms in them: Figure X or Listing X or Table X in them. For most 

>documents you can define rules that pick up 80%-90% of the formatting 

>conventions. 

> 

>We provide a visual interface for defining rules of arbitrary complexity 

>that can be used to identify formatting and then extract content. Because 

>we are starting sometimes from PDF files, sometimes from HTML files, and 

>sometimes from Word files, we aren't entirely dependent on styles, which may 

>or may not be used in a particular document. 

> 

>Our process consists of: 

> -Normalization - we first produce a basic XML document using very simple 

>XML that can be used to reproduce the original document as well as used for 

>parsing. It is much easier to parse this XML document than it is to parse a 

>typical HTML file or word file. 

> 

> -Interpretation - this paragraph is not just a paragraph - it is the 

>"TITLE" paragraph, or the "SUBTITLE" paragraph, or a section header, or a 

>caption, etc. This is where the rules related to formatting and placement 

>on the page coming in. 

> 

> -Extraction - pulling out information - such as product names, dates, phone 

>numbers, etc. out of the middle of paragraphs 

> 

> -Arrangement - Finally arranging the meaningful elements into the hierarchy 

>required by the target XML DTD or schema. Some schemas are very 

>hierarchical while others are relatively flat. 

> 

>The current release uses "precise" rules. An upcoming release lets you 

>define "imprecise", or "fuzzy" rules, which allow you to account for a 

>certain amount of vagueness in the definition of your rules to match the 

>vagueness that your authors may have had when they created the document. 

> 

>For more info, check out our white paper on converting unstructured content 

>into meaningful XML, at: http:www.cambridgedocs.com . 

> 

>Hope this helps, 

>Thanks, 

>Riz 

> 

> 

> 

>------------------------------ 

>Riz Virk, (617) 905-3518 

>[EMAIL PROTECTED], [EMAIL PROTECTED] 

>http://www.cambridgedocs.com 

> 

> 

>-----Original Message----- 

>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On 

>Behalf Of [EMAIL PROTECTED] 

>Sent: Monday, November 18, 2002 1:06 AM 

>To: 'CMS List' 

>Subject: RE: [cms-list] MSWORD to XML (FrameMaker - .doc > .pdf > 

>.xml...?) 

> 

> 

>At 10:19 AM -0500 11/17/02, Rizwan Virk wrote: 

> >Hi Mike, 

> > 

> >Our company has a product that is currently in beta for converting files 

> >into meaningful (ie.e useful) XML. The product, the xDoc Converter, can 

> >convert Word and HTML files into various XML formats. We will add PDF to 

> >XML conversion capabilities very soon. For more information, check us out 

> >at http://www.cambridgedocs.com. 

> 

>What do you do about vaguely structured text which doesn't use named 

>styles? There are so very many of them in Word... They do have a 

>title, they do have headings, but they're deeply inconsistent. I 

>know because I reformat them before I make any changes. 

> 

>Avi 

> 

>PS Converting to FrameMaker will not add named styles or structure to 

>an unorganized document. It's a great tech writing program but 

>cannot perform miracles. 

> 

>-- 

>Complete Guide to Search Engines for Web Sites and Intranets 

> 


>-- 

>http://cms-list.org/ 

>trim your replies for good karma. 

> 

>-- 

>http://cms-list.org/ 

>trim your replies for good karma. 




------------------------------------------
Add photos to your e-mail with 
------------------------------------------
MSN 8.
------------------------------------------
 Get 2 months FREE*.
------------------------------------------


--- StripMime Report -- processed MIME parts ---
text/html (html body -- converted)
---
--
http://cms-list.org/
trim your replies for good karma.

Reply via email to