There is an important distinction to be made between
what the raw format converters provide and what
the post-processing clean-up does.  The format
converters target XHTML, and the DocBook is
converted from that.

There is a trade-off between good structure and
fidelity of the rendering of the converted result.
The fidelity of rendering is also limited by the
capabilities of HTML/CSS and variances in
screen resolution and availability of fonts with
slightly different metrics.  If you care about
exact layout rendering, it means you lose some
things with respect to structure (e.g. paragraphs
breaking across page boundaries).

Word and PDF are very different situations wrt the
raw format converters:

PDF (in general) has no concept of even characters,
never mind words and paragraphs, and certainly
not lists and tables.  Everything is a difficult act
of inference.  The converter we license has a lot
of options and bells as whistles, and supports a
notion of adding annotations (either to the PDF or
to the config file driving the conversion: in
Converters/cvtpdf) to help identify tables, but at
the end of the day it is very very difficult and the
results vary from not bad to awful.  This is if you want
to have the tabular structure marked up as such in your
output. Anyway, what you get out is almost entirely p
elements.  Post-processing (which in the conversion
application in CPF is handled by several steps
in the pipeline handling XHTML) does a lot (a lot!)
of cleanup and inferencing to identify headers
and lists and block of sections.   For the PDF
conversion there are two basic kinds of PDF->XHTML
conversion we provide: one is geared towards getting
good paragraph/section structure (which makes
the rendering suffer, because paragraphs reflow
and you lose multi-column positioning or exact
positioning of images), and one geared towards
getting good fidelity of rendering (which makes
structure suffer, as you have to have each separately
positioned piece of text become its own paragraph).
In general, the PDF conversion can do a very good
job of getting the presentation features right.
When it fails it is usually either a font-substitution
issue or incomplete font-mappings in the PDF.
The former you solve by installing the fonts;
the latter you solve by adding appropriate mappings
to the PDF conversion config file.

When it comes to misidentification of lists or headers,
this is because the post-processing code made a bad
inference.  Certainly we strive to make this code
better, and welcome bug reports, but there are limits
to what is possible against an unknown set of
arbitrary documents.  If you have documents that
follow specific conventions, you can often do better
with custom code.

Word is a somewhat different story, because Word
actually knows something about paragraphs and
lists and tables.  The format converter uses the
styles to make make the identification.  The problem
is that a lot of people don't apply styles consistently
to their Word documents: e.g. to make a header
they apply a paragraph style and then just make
it big and bold, rather than applying a header style.
So again, post-processing is called upon to do
a lot of clean-up and inferencing to fix some of
these mistakes.  I have never seen a case of missing
text, however, and would be interested in seeing the
documents that cause the problem so we can fix it
(if you are willing to share).  There are not many
options available in the Word raw format converter:
this is because the rendering in Word is already more
like XHTML, with reflowing and so on.

So, the capsule summary is: be clear about your goals.
Getting both the perfect structure you as a human infer
from a document and the exact rendering is faily close
to impossible.  It is possible that you can provide better
heuristics and post-processing than that that ships in the
default conversion application, or that makes different
choices about the trade-offs to better suit your goals.

//Mary

[email protected]
Principal Engineer
Mark Logic Corporation

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to