Thanks Josh, that was my previous plan, but I thought it might be less efficient than processing as it goes.
But perhaps that's not the case... On Mon, Oct 24, 2011 at 2:51 PM, Josh Richardson <[email protected]> wrote: > As I may have mentioned, you may wish to use the complex output from > pdftohtml rather than the xml output option. The complex output provides > more functionality like creating image regions, calculating text bounding > boxes and putting them into the xml, and font-aware text-coalescing. It > also produces valid XML (save one minor bug, which should be easy to fix.) > > In response to your particular question, I don't think there is any > "in-memory" data-structure tracking the pages as they are created. Since > you're doing meta page-level processing, maybe you want to do it as a > final optional stage of processing for pdftohtml, after it loops through > and creates each page. It could go back through and read in the DOMs it > needs, run the calculations, and modify the XML files. > > Best, --josh > > On 10/22/11 10:04 AM, "Alec Taylor" <[email protected]> wrote: > >>Good morning, >> >>I'm trying to figure out how to analyse (in memory) 3 pages from the >>pdftohtml -xml book.pdf stream, (so before it is written to the >>book.xml output file). >> >>Due to the enhancement I'm implementing onto pdftohtml, my algorithm >>requires analysis of 3 pages at a time. >> >>[p1] R [p2] R [p3] >>then >>[p2] R [p3] R [p4] >>continue till no pages are left >> >>(where 'R' refers to the relation I'm running on each page trio) >> >>How do I run this relation? - Preferably using some data-structure >>(i.e. intermediary in-memory XML for analysis with libxml2 libraries) >> >>Thanks for all suggestions, >> >>Alec Taylor >>_______________________________________________ >>poppler mailing list >>[email protected] >>http://lists.freedesktop.org/mailman/listinfo/poppler >> > > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
