It depends upon what exactly you're doing, and on your use-case, but yes, processing as you go could be more efficient. From what I'm imagining, the efficiency lost may be a worthwhile tradeoff to keep the program structure simpler.
Best, --josh On 10/23/11 9:20 PM, "Alec Taylor" <[email protected]> wrote: >Thanks Josh, that was my previous plan, but I thought it might be less >efficient than processing as it goes. > >But perhaps that's not the case... > >On Mon, Oct 24, 2011 at 2:51 PM, Josh Richardson <[email protected]> wrote: >> As I may have mentioned, you may wish to use the complex output from >> pdftohtml rather than the xml output option. The complex output >>provides >> more functionality like creating image regions, calculating text >>bounding >> boxes and putting them into the xml, and font-aware text-coalescing. It >> also produces valid XML (save one minor bug, which should be easy to >>fix.) >> >> In response to your particular question, I don't think there is any >> "in-memory" data-structure tracking the pages as they are created. >>Since >> you're doing meta page-level processing, maybe you want to do it as a >> final optional stage of processing for pdftohtml, after it loops through >> and creates each page. It could go back through and read in the DOMs it >> needs, run the calculations, and modify the XML files. >> >> Best, --josh >> >> On 10/22/11 10:04 AM, "Alec Taylor" <[email protected]> wrote: >> >>>Good morning, >>> >>>I'm trying to figure out how to analyse (in memory) 3 pages from the >>>pdftohtml -xml book.pdf stream, (so before it is written to the >>>book.xml output file). >>> >>>Due to the enhancement I'm implementing onto pdftohtml, my algorithm >>>requires analysis of 3 pages at a time. >>> >>>[p1] R [p2] R [p3] >>>then >>>[p2] R [p3] R [p4] >>>continue till no pages are left >>> >>>(where 'R' refers to the relation I'm running on each page trio) >>> >>>How do I run this relation? - Preferably using some data-structure >>>(i.e. intermediary in-memory XML for analysis with libxml2 libraries) >>> >>>Thanks for all suggestions, >>> >>>Alec Taylor >>>_______________________________________________ >>>poppler mailing list >>>[email protected] >>>http://lists.freedesktop.org/mailman/listinfo/poppler >>> >> >> > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
