[poppler] pdf to xml update

Jauco Noordzij Sun, 13 Aug 2006 08:38:05 -0700

Hi folks,

I'm still working on the pdf-to-structured-xml outputdev and I have
published a first tryout of a patch at http://jauco.nl/blog/?p=27 . I
am wondering what you guys think of it :)


It parses the pages, aggregating textblocks in much the same way as
the current textoutputdev. It then chunks the page into a tree of
nested 'splits'. ie. The page is split in two, then the two parts are
split in two etc. This tree is then turned into blocks and paragraphs.
The process is a bit hard to explain, but works quite well. If anyone
is really interested I suggest they search for 'recursive XY cut' in
google scholar. The result is a tree that has the text in reading
order (even quite complex layouts) and from that tree the outputdev
can recognise blocks of text and columns.

I also have a quick question: Is there a callback function for
outputdevs that gets called at the end of processing the pdf, like the
one that's called at the end of the page? That would be a nice place
to do some multi-page analysing and add a function to convert the
whole structured tree to

--
Greetings,
Jauco Noordzij
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] pdf to xml update

Reply via email to