There is an important distinction to be made between what the raw format converters provide and what the post-processing clean-up does. The format converters target XHTML, and the DocBook is converted from that.
There is a trade-off between good structure and fidelity of the rendering of the converted result. The fidelity of rendering is also limited by the capabilities of HTML/CSS and variances in screen resolution and availability of fonts with slightly different metrics. If you care about exact layout rendering, it means you lose some things with respect to structure (e.g. paragraphs breaking across page boundaries). Word and PDF are very different situations wrt the raw format converters: PDF (in general) has no concept of even characters, never mind words and paragraphs, and certainly not lists and tables. Everything is a difficult act of inference. The converter we license has a lot of options and bells as whistles, and supports a notion of adding annotations (either to the PDF or to the config file driving the conversion: in Converters/cvtpdf) to help identify tables, but at the end of the day it is very very difficult and the results vary from not bad to awful. This is if you want to have the tabular structure marked up as such in your output. Anyway, what you get out is almost entirely p elements. Post-processing (which in the conversion application in CPF is handled by several steps in the pipeline handling XHTML) does a lot (a lot!) of cleanup and inferencing to identify headers and lists and block of sections. For the PDF conversion there are two basic kinds of PDF->XHTML conversion we provide: one is geared towards getting good paragraph/section structure (which makes the rendering suffer, because paragraphs reflow and you lose multi-column positioning or exact positioning of images), and one geared towards getting good fidelity of rendering (which makes structure suffer, as you have to have each separately positioned piece of text become its own paragraph). In general, the PDF conversion can do a very good job of getting the presentation features right. When it fails it is usually either a font-substitution issue or incomplete font-mappings in the PDF. The former you solve by installing the fonts; the latter you solve by adding appropriate mappings to the PDF conversion config file. When it comes to misidentification of lists or headers, this is because the post-processing code made a bad inference. Certainly we strive to make this code better, and welcome bug reports, but there are limits to what is possible against an unknown set of arbitrary documents. If you have documents that follow specific conventions, you can often do better with custom code. Word is a somewhat different story, because Word actually knows something about paragraphs and lists and tables. The format converter uses the styles to make make the identification. The problem is that a lot of people don't apply styles consistently to their Word documents: e.g. to make a header they apply a paragraph style and then just make it big and bold, rather than applying a header style. So again, post-processing is called upon to do a lot of clean-up and inferencing to fix some of these mistakes. I have never seen a case of missing text, however, and would be interested in seeing the documents that cause the problem so we can fix it (if you are willing to share). There are not many options available in the Word raw format converter: this is because the rendering in Word is already more like XHTML, with reflowing and so on. So, the capsule summary is: be clear about your goals. Getting both the perfect structure you as a human infer from a document and the exact rendering is faily close to impossible. It is possible that you can provide better heuristics and post-processing than that that ships in the default conversion application, or that makes different choices about the trade-offs to better suit your goals. //Mary [email protected] Principal Engineer Mark Logic Corporation _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
