In article <mailman.7056.1392559276.18130.python-l...@python.org>, "F.R." <anthra.nor...@bluewin.ch> wrote:
> Hi all, > > Struggling to parse bank statements unavailable in sensible > data-transfer formats, I use pdftotext, which solves part of the > problem. The other day I encountered a strange thing, when one single > figure out of many erroneously converted into letters. Adobe Reader > displays the figure 50'000 correctly, but pdftotext makes it into > "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would > expect such a mistake from an OCR. However, the statement is not a scan, > but is made up of text. Because malfunctions like this put a damper on > the hope to ever have a reliable reader that doesn't require > time-consuming manual verification, I played around a bit and ended up > even more confused: When I lift the figure off the Adobe display (mark, > copy) and paste it into a Python IDLE window, it is again letters (ascii > 83 and 79), when on the Adobe display it shows correctly as digits. How > can that be? > > Frederic Maybe it's an intentional effort to keep people from screen-scraping data out of the PDFs (or perhaps trace when they do). Is it possible the document includes a font where those codepoints are drawn exactly the same as the digits they resemble? Keep in mind that PDF is not a data transmission format, it's a document format. When you try to scape data out of a PDF, you've made a pact with the devil. Unclear what any of this has to do with Python. Maybe the tie-in is that in the old Snake video game, the snake was drawn as Soooooo? Anyway, it's S as in Sierra, and O as in Oscar. -- https://mail.python.org/mailman/listinfo/python-list