[poppler] Extra spaces in text when using Poppler pdftotext

Runar Buvik Mon, 27 May 2013 05:34:47 -0700

Hi

I am using your pdftotext program to extract text from a large number
of PDF files. Unfortunately some words get extra spaces between the
characters. For example in one PDF files the word “Wasserberg” appears
as “W a s s e r b e r g”.


However if I cut and paste this text from Acrobat Reader the text is
more correctly formatted.

The following image expanse this better:
http://bbh-001.boitho.com/div/pdf_space_bug/Text.png (notice how the
yellow text has extra spaces in it in the Popler version).

Any thoughts on how I can extract a more correct text?



An example PDF that gets converted like this is available at.
http://bbh-001.boitho.com/div/pdf_space_bug/example.pdf . It isn’t
perfect because we have to do some manual modification to remove
confidential information, but it shows the symptoms correctly.

The text is from a scan and then it is OCRed using ABBYY. I am using
Poppler 0.22.4.


Best regards
Runar Buvik
CTO Searchdaimon As
http://www.searchdaimon.com/
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] Extra spaces in text when using Poppler pdftotext

Reply via email to