Poppler doesn't fully support 1.7. Perhaps 1.3 was an understatement.
I will add in the aforementioned heuristics (I don't know my accuracy yet, but the kind of algorithms I am implementing have >98% accuracy), using whatever assistence poppler provides, adhering to the latest standard poppler supports. Would appreciate any help you (or anyone else) can give for pushing what I have separated into XML tags back into the PDF. Regards, Alec Taylor On Fri, Nov 11, 2011 at 9:15 AM, Leonard Rosenthol <[email protected]> wrote: > I am sorry to be pedantic, but this is EXTREMELY IMPORTANT… > What you are doing is adding HEURISTICS into Poppler to GUESS at the logical > structure of a PDF. You are NOT actually taking into account any REAL LIVE > logical structure that was put their by the PDF producer. > PDF 1.3 is about 15 YEARS OLD. NUMEROUS ADVANCES have been made to the > format. PDF is currently at 1.7, as standardized by the ISO and adopted as > national standards by almost 50 countries around the world. Version 2.0 > (ISO 32000-2) is almost complete! To work only with 1.3 is, honestly, a > waste. You are missing HUGE PIECES of functionality found in the majority > of real-world documents. > I am sure your code is wonderful. However, given that it is based on 1.3 > and does not recognize existing PDF structure, it seems SEVERELY limited in > real world use. > Leonard > From: Alec Taylor <[email protected]> > Date: Thu, 10 Nov 2011 13:57:54 -0800 > To: Leonard Rosenthol <[email protected]> > Cc: "[email protected]" <[email protected]>, Albert > Cid <[email protected]> > Subject: Re: [poppler] Extract title from pdf file. > > As was previously mentioned, I am adding the semantic and logical > structuring into poppler core. > > My plan is to figure out what fits into which category by post processing > the XML. Any suggestions on how to reverse [or post?!] engineer this XML > back into the PDF would be appreciated. > > In a few days I will have a very accurate XML genereated with > <header></header>, <footer></footer> and table of contents tags. > > This will involve the "pushing" of the actual "printed" page numbers, and > adding hyperlink to each ToC entry, and partitioning the page structure as > far as the 1.3 standard allows. > > My code is extremely modular, neat & efficient, and included the writing of > an OO API. So it should be easily extendable with author, title, publisher, > year and section title extraction capabilities. _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
