On 9/15/11 5:43 PM, "Dave" <[email protected]> wrote: > In the file Gfx I read the commands and I have access to the string of >character directly from those commands, the text is a parameter, of TJ or >Tj,
That's a recipe for FAILURE! Most PDF documents in the real world do NOT do that. The values in the TJ/Tj are CIDs into the font! You MUST use the font & encoding information to get the correct values. >since all the pieces of text from the same paragraph are always between BT >(begin text) and ET (end text) I can correctly extract the whole >paragraph, so i >dont need to made any guess or more complex process. Again, that's FAR FROM reality in the majority of PDFs. I've seen numerous examples where EACH WORD (or even each letter!) is in it's own BT/ET block. >The problem with this way >is, sometimes instead of letters, I got some weird stuffs (it prints like >a 2x2 >table with numbers), See reason above. And also a reason you need to get yourself a copy of ISO 32000-1:2008. Leonard _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
