Rizwan, Is the product ready or still under development?
Thanks, Hari >From: "Rizwan Virk" >To: , "'CMS List'" >Subject: RE: [cms-list] MSWORD to XML (FrameMaker - .doc > .pdf > .xml...?) >Date: Mon, 18 Nov 2002 09:46:24 -0500 > >Hello Avi, > >That's a good point. The more "consistent" and "well-structured" the >document - the easier it is to extract meaningful information out of the >document. Of course, if "styles" are used everywhere - then this is a big >plus - but we all know that authors don't always use styles. > >If not, however, you can still define rules based on formatting, based on >content, based on examples, etc. - for example even though a style may not >have been used for caption - you can find out that in this document most >captions are bold, size 9, except for a few, which occur just after an image >but are actually size 10 - and all of the captions have one of the following >terms in them: Figure X or Listing X or Table X in them. For most >documents you can define rules that pick up 80%-90% of the formatting >conventions. > >We provide a visual interface for defining rules of arbitrary complexity >that can be used to identify formatting and then extract content. Because >we are starting sometimes from PDF files, sometimes from HTML files, and >sometimes from Word files, we aren't entirely dependent on styles, which may >or may not be used in a particular document. > >Our process consists of: > -Normalization - we first produce a basic XML document using very simple >XML that can be used to reproduce the original document as well as used for >parsing. It is much easier to parse this XML document than it is to parse a >typical HTML file or word file. > > -Interpretation - this paragraph is not just a paragraph - it is the >"TITLE" paragraph, or the "SUBTITLE" paragraph, or a section header, or a >caption, etc. This is where the rules related to formatting and placement >on the page coming in. > > -Extraction - pulling out information - such as product names, dates, phone >numbers, etc. out of the middle of paragraphs > > -Arrangement - Finally arranging the meaningful elements into the hierarchy >required by the target XML DTD or schema. Some schemas are very >hierarchical while others are relatively flat. > >The current release uses "precise" rules. An upcoming release lets you >define "imprecise", or "fuzzy" rules, which allow you to account for a >certain amount of vagueness in the definition of your rules to match the >vagueness that your authors may have had when they created the document. > >For more info, check out our white paper on converting unstructured content >into meaningful XML, at: http:www.cambridgedocs.com . > >Hope this helps, >Thanks, >Riz > > > >------------------------------ >Riz Virk, (617) 905-3518 >[EMAIL PROTECTED], [EMAIL PROTECTED] >http://www.cambridgedocs.com > > >-----Original Message----- >From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On >Behalf Of [EMAIL PROTECTED] >Sent: Monday, November 18, 2002 1:06 AM >To: 'CMS List' >Subject: RE: [cms-list] MSWORD to XML (FrameMaker - .doc > .pdf > >.xml...?) > > >At 10:19 AM -0500 11/17/02, Rizwan Virk wrote: > >Hi Mike, > > > >Our company has a product that is currently in beta for converting files > >into meaningful (ie.e useful) XML. The product, the xDoc Converter, can > >convert Word and HTML files into various XML formats. We will add PDF to > >XML conversion capabilities very soon. For more information, check us out > >at http://www.cambridgedocs.com. > >What do you do about vaguely structured text which doesn't use named >styles? There are so very many of them in Word... They do have a >title, they do have headings, but they're deeply inconsistent. I >know because I reformat them before I make any changes. > >Avi > >PS Converting to FrameMaker will not add named styles or structure to >an unorganized document. It's a great tech writing program but >cannot perform miracles. > >-- >Complete Guide to Search Engines for Web Sites and Intranets > >-- >http://cms-list.org/ >trim your replies for good karma. > >-- >http://cms-list.org/ >trim your replies for good karma. ------------------------------------------ Add photos to your e-mail with ------------------------------------------ MSN 8. ------------------------------------------ Get 2 months FREE*. ------------------------------------------ --- StripMime Report -- processed MIME parts --- text/html (html body -- converted) --- -- http://cms-list.org/ trim your replies for good karma.
