Why not just build a pdf interpreter? This is meant as a real question,
not a criticism.
The thing with PDFs is, so much high quality information is locked up in
them and we have no reliable way to extract it.
The reason seems to be that until the program (PDF interpreter) runs, no
one who is not the author can be sure what it will "say".
Sure, not all PDF documents contain subtle code, but the sad fact is,
the output could be the result of an arbitrary processing chain. It
seems like this ongoing somewhat tortured problem could be finished once
and for all by building an interpreter. Accumulate the output of
"show" and all its variations. That's your text, no? Then keep state on
the x y positioning information for all fragments.
I am sure this won't work or someone would have done it. But why won't
it work?