Hmm. MuPDF, bless their hearts, is a cool bit of tech, but MUCH less sophisticated than Poppler. If I found the right project, pdfdraw is no exception -- a very small piece of code that doesn't do any structure analysis; it looks like it just spits out whatever blobs are natively in the PDF. If you find that I'm wrong about that, please let me know.
If you start with Poppler, and my version of pdftohtml in particular, then you at least start out with a notion of words, lines of text, and paragraphs -- albeit that these things are not very accurate. Each of those entities is tagged with font size and style. You also get bounding boxes on all that text, as well as image objects (coalesced from multiple draw operations,) which I use to find the page margins, but can be extended to find some of the other items you're interested in finding. Best, --josh On 10/11/11 9:08 PM, "Alec Taylor" <[email protected]> wrote: >Thanks Josh, I was actually researching quite heavily, and found >myself on the #ghostscript channel @ freenode > >They pointed me to MuPDF (one of there projects), and it seems like >the "pdfdraw" example project is something to work from, either >directly; or through parsing XML output from it. > >However, if this doesn't suit your needs, please tell me why, as I >might have the same problem, and then I'll join forces! :] > >On Wed, Oct 12, 2011 at 3:44 AM, Josh Richardson <[email protected]> wrote: >> Thanks for the pointer, Glad. >> >> FYI, I am also interested in being able to analyze document structure. >> Our first step is to put the text back together, since in many PDFs, it >>is >> not logically organized in the original PDF. pdf2html has a "coalesce" >> function which is the starting point for us. We have made some >> improvements on it which are not yet contributed back -- so let me know >>if >> you want the source and/or if you want to join forces. >> >> --josh >> >> On 10/11/11 12:31 AM, "Glad Deschrijver" <[email protected]> >> wrote: >> >>>On Tuesday 11 October 2011, Alec Taylor wrote: >>>> Good afternoon, >>>> >>>> Do you have some recommends and/or sample code for comparing textual >>>> and geometric layout information across pages? >>>> >>>> Basically I'm trying to realise patterns within documents, e.g., page >>>> numbers, header and footers, title, column information &etc; using the >>>> capabilities of the Poppler PDF library. >>> >>>Not sure that it will help you much, but you can have a look at DiffPDF >>>which >>>uses poppler to compare two PDF files page by page (both textually and >>>visually): >>>http://www.qtrac.eu/diffpdf.html >>> >>>Best regards, >>>Glad >>> >>>-- >>> Everything that is really great and inspiring is created by >>> the individual who can labor in freedom. >>> -- Albert Einstein, Out of My Later Years (1950) >>> >>>_______________________________________________ >>>poppler mailing list >>>[email protected] >>>http://lists.freedesktop.org/mailman/listinfo/poppler >>> >> >> > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
