A Dimecres, 12 d'octubre de 2011, Alec Taylor vàreu escriure: > I can get bounding boxes?
/me points to the various getBBox functions in TextOutputDev.h or to the TextBox class in the Qt4 Albert > > SOLD! - I'll start using your product now :] > > On Wed, Oct 12, 2011 at 3:23 PM, Josh Richardson <[email protected]> wrote: > > Hmm. MuPDF, bless their hearts, is a cool bit of tech, but MUCH less > > sophisticated than Poppler. If I found the right project, pdfdraw is no > > exception -- a very small piece of code that doesn't do any structure > > analysis; it looks like it just spits out whatever blobs are natively in > > the PDF. If you find that I'm wrong about that, please let me know. > > > > If you start with Poppler, and my version of pdftohtml in particular, > > then you at least start out with a notion of words, lines of text, and > > paragraphs -- albeit that these things are not very accurate. Each of > > those entities is tagged with font size and style. You also get > > bounding boxes on all that text, as well as image objects (coalesced > > from multiple draw operations,) which I use to find the page margins, > > but can be extended to find some of the other items you're interested > > in finding. > > > > Best, --josh > > > > On 10/11/11 9:08 PM, "Alec Taylor" <[email protected]> wrote: > >>Thanks Josh, I was actually researching quite heavily, and found > >>myself on the #ghostscript channel @ freenode > >> > >>They pointed me to MuPDF (one of there projects), and it seems like > >>the "pdfdraw" example project is something to work from, either > >>directly; or through parsing XML output from it. > >> > >>However, if this doesn't suit your needs, please tell me why, as I > >>might have the same problem, and then I'll join forces! :] > >> > >>On Wed, Oct 12, 2011 at 3:44 AM, Josh Richardson <[email protected]> wrote: > >>> Thanks for the pointer, Glad. > >>> > >>> FYI, I am also interested in being able to analyze document > >>> structure. > >>> Our first step is to put the text back together, since in many PDFs, > >>> it > >>> > >>>is > >>> > >>> not logically organized in the original PDF. pdf2html has a > >>> "coalesce" > >>> function which is the starting point for us. We have made some > >>> improvements on it which are not yet contributed back -- so let me > >>> know > >>> > >>>if > >>> > >>> you want the source and/or if you want to join forces. > >>> > >>> --josh > >>> > >>> On 10/11/11 12:31 AM, "Glad Deschrijver" > >>> <[email protected]> > >>> > >>> wrote: > >>>>On Tuesday 11 October 2011, Alec Taylor wrote: > >>>>> Good afternoon, > >>>>> > >>>>> Do you have some recommends and/or sample code for comparing > >>>>> textual > >>>>> and geometric layout information across pages? > >>>>> > >>>>> Basically I'm trying to realise patterns within documents, e.g., > >>>>> page > >>>>> numbers, header and footers, title, column information &etc; > >>>>> using the capabilities of the Poppler PDF library. > >>>> > >>>>Not sure that it will help you much, but you can have a look at > >>>>DiffPDF > >>>>which > >>>>uses poppler to compare two PDF files page by page (both textually > >>>>and > >>>>visually): > >>>>http://www.qtrac.eu/diffpdf.html > >>>> > >>>>Best regards, > >>>>Glad > >>>> > >>>>-- > >>>> > >>>> Everything that is really great and inspiring is created by > >>>> the individual who can labor in freedom. > >>>> -- Albert Einstein, Out of My Later Years (1950) > >>>> > >>>>_______________________________________________ > >>>>poppler mailing list > >>>>[email protected] > >>>>http://lists.freedesktop.org/mailman/listinfo/poppler > > _______________________________________________ > poppler mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
