Hi, Could you let me know where I could download some sample?
Ihar `Philips` Filipau wrote: > P.S. Okular (KDE 4.7.4) also showed that the embedded subset fonts are > "Type 1C" and have the funny names: > IKFZYK+MSTT31c39b00 > ILOQIT+MSTT31c38e00 > MBQOWW+MSTT31c38100 Please do not call them as "funny" :-), such names are quite popular in the PDFs that TrueType fonts were converted to PostScript Type1 fonts in embedding. I'm afraid the PDFs are generated without the consideration about the text extraction, and, if my guessing is right, even Adobe products cannot extract the texts. Regards, mpsuzuki Ihar `Philips` Filipau wrote: > Hi All! > > I have encountered another strange PDF document. When viewing it in > graphical viewers like Okular/Evince/Reader/FoxIt - it looks totally > fine. > > But when I extract the content using the pdftotext or pdftohtml, the > text is garbled. > > Little tinkering with the output, showed that ASCII characters as if > have being shifted by 29. E.g. '5' (0x35) became 0x18. I have applied > a simple script to add 29 to the characters and can now read most of > the text (except for the German umlauts; also some strange characters > appear in beginning of some lines). > > I gather my question would be: what should I fix in pdftohtml to make > it print text properly? > > > P.S. Okular (KDE 4.7.4) also showed that the embedded subset fonts are > "Type 1C" and have the funny names: > IKFZYK+MSTT31c39b00 > ILOQIT+MSTT31c38e00 > MBQOWW+MSTT31c38100 > _______________________________________________ > poppler mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
