On 5/29/13 12:13 PM, "Ihar `Philips` Filipau" <[email protected]> wrote: >There is no 100% reliable way to extract information from PDF.
This is a MUCH TOO COMMON "bubba meisa" (<http://en.wiktionary.org/wiki/bubbe-meise>) about PDF. >>PDF is a vector graphics format. There is no such thing as "word" >there. There are only functions to paint a string of 1 or more >characters at given page offset with given font. You get the idea. This is simply NOT TRUE about the PDF file format. PDF supports a very rich semantic layer called "Tagged PDF" that has been part of the language for almost 15 years now (since PDF 1.4). However, it is true that many PDFs are created without this semantic richness, which leads to difficulties in extraction. And in that case, as you recommend, the original source is probably best to work with since the semantics are still present. Leonard _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
