Hi I am using your pdftotext program to extract text from a large number of PDF files. Unfortunately some words get extra spaces between the characters. For example in one PDF files the word “Wasserberg” appears as “W a s s e r b e r g”.
However if I cut and paste this text from Acrobat Reader the text is more correctly formatted. The following image expanse this better: http://bbh-001.boitho.com/div/pdf_space_bug/Text.png (notice how the yellow text has extra spaces in it in the Popler version). Any thoughts on how I can extract a more correct text? An example PDF that gets converted like this is available at. http://bbh-001.boitho.com/div/pdf_space_bug/example.pdf . It isn’t perfect because we have to do some manual modification to remove confidential information, but it shows the symptoms correctly. The text is from a scan and then it is OCRed using ABBYY. I am using Poppler 0.22.4. Best regards Runar Buvik CTO Searchdaimon As http://www.searchdaimon.com/ _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
