Re: Puzzling PDF

Emile van Sebille Sun, 16 Feb 2014 08:31:23 -0800

You
On 2/16/2014 6:00 AM, F.R. wrote:

Hi all,


Struggling to parse bank statements unavailable in sensible
data-transfer formats, I use pdftotext, which solves part of the
problem. The other day I encountered a strange thing, when one single
figure out of many erroneously converted into letters. Adobe Reader
displays the figure 50'000 correctly, but pdftotext makes it into
"SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would
expect such a mistake from an OCR. However, the statement is not a scan,
but is made up of text. Because malfunctions like this put a damper on
the hope to ever have a reliable reader that doesn't require
time-consuming manual verification, I played around a bit and ended up
even more confused: When I lift the figure off the Adobe display (mark,
copy) and paste it into a Python IDLE window, it is again letters (ascii
83 and 79), when on the Adobe display it shows correctly as digits. How
can that be?

I've also gotten inconsistent results using various pdf to textconverters[1], but getting an explanation for pdf2totext's failings hereisn't likely to happen. I'd first try google doc's on-line conversiontool to see if you get better results. If you're lucky it'll do the joband you'll have confirmation that better tools exist. Otherwise, I'dlook for an alternate way of getting the bank info than working from thepdf statement. At one site I've scripted firefox to access the bank'sweb based inquiry to retrieve the new activity overnight and use that tocomplete a daily bank reconciliation.


HTH,

Emile

[1] I wrote my own once to get data out of a particularly gnarly EDIspecification pdf.





--
https://mail.python.org/mailman/listinfo/python-list

Re: Puzzling PDF

Reply via email to