Re: [poppler] Extra spaces in text when using Poppler pdftotext

Leonard Rosenthol Wed, 29 May 2013 09:34:51 -0700

On 5/29/13 12:13 PM, "Ihar `Philips` Filipau" <[email protected]> wrote:
>There is no 100% reliable way to extract information from PDF.


This is a MUCH TOO COMMON "bubba meisa"
(<http://en.wiktionary.org/wiki/bubbe-meise>) about PDF.


>>PDF is a vector graphics format. There is no such thing as "word"
>there. There are only functions to paint a string of 1 or more
>characters at given page offset with given font. You get the idea.

This is simply NOT TRUE about the PDF file format.  PDF supports a very
rich semantic layer called "Tagged PDF" that has been part of the language
for almost 15 years now (since PDF 1.4).

However, it is true that many PDFs are created without this semantic
richness, which leads to difficulties in extraction.  And in that case, as
you recommend, the original source is probably best to work with since the
semantics are still present.


Leonard


_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] Extra spaces in text when using Poppler pdftotext

Reply via email to