Re: Puzzling PDF

F.R. Sun, 16 Feb 2014 14:42:05 -0800

On 02/16/2014 05:29 PM, Emile van Sebille wrote:

You
On 2/16/2014 6:00 AM, F.R. wrote:
Hi all,
Struggling to parse bank statements unavailable in sensible
data-transfer formats, I use pdftotext, which solves part of the
problem. The other day I encountered a strange thing, when one single
figure out of many erroneously converted into letters. Adobe Reader
displays the figure 50'000 correctly, but pdftotext makes it into
"SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would
expect such a mistake from an OCR. However, the statement is not a scan,
but is made up of text. Because malfunctions like this put a damper on
the hope to ever have a reliable reader that doesn't require
time-consuming manual verification, I played around a bit and ended up
even more confused: When I lift the figure off the Adobe display (mark,
copy) and paste it into a Python IDLE window, it is again letters (ascii
83 and 79), when on the Adobe display it shows correctly as digits. How
can that be?
I've also gotten inconsistent results using various pdf to textconverters[1], but getting an explanation for pdf2totext's failingshere isn't likely to happen. I'd first try google doc's on-lineconversion tool to see if you get better results. If you're luckyit'll do the job and you'll have confirmation that better toolsexist. Otherwise, I'd look for an alternate way of getting the bankinfo than working from the pdf statement. At one site I've scriptedfirefox to access the bank's web based inquiry to retrieve the newactivity overnight and use that to complete a daily bank reconciliation.
HTH,

Emile
[1] I wrote my own once to get data out of a particularly gnarly EDIspecification pdf.


Emile, thanks for your response. Thanks to Roy Smith and Alister, too.

pdftotext has been working just fine. So much so that this freakincident is all the more puzzling. It smacks of an OCR error, but wheredoes OCR come in, I wonder. I certainly suspected that the font I waslooking at had fives and zeroes identical to esses and ohs,respectively, but the suspicion didn't hold up to scrutiny. I attach alittle screen shot: At the top, the way it looks on the statement. Next,two words marked with the mouse. (One single marking, doesn't color thespace.) Ctl-c puts both words to the clip board. Ctl-v drops them intothe python IDLE window between the quotation marks. Lo and behold:they're clearly different! A little bit of code around displays theascii numbers. Isn't that interesting?


Frederic

No matter. You're both right. There are alternatives. The best would beto get the data in a CSV format. Alas, I am so lightweight a client thatbanks don't even bother to find out what I am talking about.

I know how to access web pages programmatically, but haven't gottenaround to dealing with password-protected log-ins and to sending suchdata as one writes into templates interactively.


Frederic

<<attachment: pdf-weirdness.gif>>

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Puzzling PDF

Reply via email to