Hi folks, I'm still working on the pdf-to-structured-xml outputdev and I have published a first tryout of a patch at http://jauco.nl/blog/?p=27 . I am wondering what you guys think of it :)
It parses the pages, aggregating textblocks in much the same way as the current textoutputdev. It then chunks the page into a tree of nested 'splits'. ie. The page is split in two, then the two parts are split in two etc. This tree is then turned into blocks and paragraphs. The process is a bit hard to explain, but works quite well. If anyone is really interested I suggest they search for 'recursive XY cut' in google scholar. The result is a tree that has the text in reading order (even quite complex layouts) and from that tree the outputdev can recognise blocks of text and columns. I also have a quick question: Is there a callback function for outputdevs that gets called at the end of processing the pdf, like the one that's called at the end of the page? That would be a nice place to do some multi-page analysing and add a function to convert the whole structured tree to -- Greetings, Jauco Noordzij _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
