Dear Rob, Poppler extracts the text from PDF via the serie of glyphs. Therefore, the scripts that the Unicode encode the characters as visible order, the first step of the text extraction is possible.
However, some Asian scripts, especially Brahmic-based scripts, have very complicated layout rules, so, the encoding order in Unicode text is phonetic and different from the visible order (e.g. coded characters are in consonant-then-vowel order, but the displayed characters are in vowel-then-consonant order). In such case, the character serie extracted via the glyph serie is not good coded text. I'm not sure which script you assume for Indonesian (Latin? Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts, only Thai script is coded in visible order. Other scripts have vowel-then-consonant encoding issue, so, it is not easy for Poppler to extract the text in correct "Unicode" text. Therefore, the result you have (Thai is OK, others are not) sounds reasonable. I'm unfamiliar with the bleeding-edge technology in the latedt PDF about how to deal with such complex script (I guess PDF developers are willing to support such), but, the PDFs made by old PDF production softwares may have similar problem. I wish some Adobe experts mentions about the situation in the latest PDF for complex scripts :-) Regards, mpsuzuki Rob Hawkins wrote: > Greetings all, > > Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and > Vietnamese? I didn't see a language pack for any except Thai, and that one > doesn't produce properly formatted characters for my source files. They're > missing the vowel marks. The other languages fail completely on my setup. > I've tried on OS X and Ubuntu 12. > > My source files are here: > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf > > Chinese seems to work fine. > > I found out that PDF.js will produce good output, though I already have > code based on pdftohtml output and would rather not switch if not > necessary. I wonder if there is something wrong with my setup. > > Thanks for any help even if it's just a "nope, that's not possible" kind of > reply =) > > Rob > > > > ------------------------------------------------------------------------ > > _______________________________________________ > poppler mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
