re:text extraction

BassyBassyGoodBoy Thu, 28 Jul 2011 22:36:16 -0700

Why not just build a pdf interpreter? This is meant as a real question,not a criticism.

The thing with PDFs is, so much high quality information is locked up inthem and we have no reliable way to extract it.

The reason seems to be that until the program (PDF interpreter) runs, noone who is not the author can be sure what it will "say".

Sure, not all PDF documents contain subtle code, but the sad fact is,the output could be the result of an arbitrary processing chain. Itseems like this ongoing somewhat tortured problem could be finished onceand for all by building an interpreter. Accumulate the output of"show" and all its variations. That's your text, no? Then keep state onthe x y positioning information for all fragments.

I am sure this won't work or someone would have done it. But why won'tit work?

re:text extraction

Reply via email to