On 04.09.2009 15:22, Dola Woolfe wrote:
(Sounds like more than the 1 hour  I was allocating for it.)

PDF as a format isn't meant to be parsed for advanced text processing,
it was designed for presentation. PDF generators could make your job
of parsing text out of the file arbitrarily hard. As an extreme (and rather theoretical) example, a PDF could contain two text streams
"Tiset" and "hsiatx", with embedded positioning commands, which
reads on the screen as "This is a text". In any case, even putting
up reasonable guards against running into out-of-order text blocks
will take a few days, unless you find a ready-to-use library for
this task (no, I don't have pointers).

If you can, try to get your source text in a more processing-friendly
format, like DocBook XML.

J.Pietschmann

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-users-unsubscr...@xmlgraphics.apache.org
For additional commands, e-mail: fop-users-h...@xmlgraphics.apache.org

Reply via email to