As I may have mentioned, you may wish to use the complex output from pdftohtml rather than the xml output option. The complex output provides more functionality like creating image regions, calculating text bounding boxes and putting them into the xml, and font-aware text-coalescing. It also produces valid XML (save one minor bug, which should be easy to fix.)
In response to your particular question, I don't think there is any "in-memory" data-structure tracking the pages as they are created. Since you're doing meta page-level processing, maybe you want to do it as a final optional stage of processing for pdftohtml, after it loops through and creates each page. It could go back through and read in the DOMs it needs, run the calculations, and modify the XML files. Best, --josh On 10/22/11 10:04 AM, "Alec Taylor" <[email protected]> wrote: >Good morning, > >I'm trying to figure out how to analyse (in memory) 3 pages from the >pdftohtml -xml book.pdf stream, (so before it is written to the >book.xml output file). > >Due to the enhancement I'm implementing onto pdftohtml, my algorithm >requires analysis of 3 pages at a time. > >[p1] R [p2] R [p3] >then >[p2] R [p3] R [p4] >continue till no pages are left > >(where 'R' refers to the relation I'm running on each page trio) > >How do I run this relation? - Preferably using some data-structure >(i.e. intermediary in-memory XML for analysis with libxml2 libraries) > >Thanks for all suggestions, > >Alec Taylor >_______________________________________________ >poppler mailing list >[email protected] >http://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
