Thank you all for these great replies. I find the stuff about the unicode encoding order really interesting. And I too wish we could find more information about the as-yet unmapped Asian scripts.
I was mistaken about the output of PDF.js. I thought I had viewed the HTML source and seen good data, how exciting! Yet now I that I double check, I see it is just the viewer that is correct, and the source text is garbled just like pdftotext etc. I'm bummed there is no magic solution here as I thought I had found, but glad to see people are still interested in this. If I find out how to implement these languages, I will try. Alternatively, can we band together to destroy PDFs everywhere? If we work in concert it may be possible. =) Thanks again, Rob On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya <[email protected]> wrote: > Dear Rob, > > Poppler extracts the text from PDF via the serie of glyphs. > Therefore, the scripts that the Unicode encode the characters > as visible order, the first step of the text extraction is > possible. > > However, some Asian scripts, especially Brahmic-based scripts, > have very complicated layout rules, so, the encoding order > in Unicode text is phonetic and different from the visible > order (e.g. coded characters are in consonant-then-vowel order, > but the displayed characters are in vowel-then-consonant order). > > In such case, the character serie extracted via the glyph serie > is not good coded text. > > I'm not sure which script you assume for Indonesian (Latin? > Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts, > only Thai script is coded in visible order. Other scripts > have vowel-then-consonant encoding issue, so, it is not easy > for Poppler to extract the text in correct "Unicode" text. > Therefore, the result you have (Thai is OK, others are not) > sounds reasonable. > > I'm unfamiliar with the bleeding-edge technology in the latedt > PDF about how to deal with such complex script (I guess PDF > developers are willing to support such), but, the PDFs made > by old PDF production softwares may have similar problem. > > I wish some Adobe experts mentions about the situation in the > latest PDF for complex scripts :-) > > Regards, > mpsuzuki > > Rob Hawkins wrote: > > Greetings all, > > > > Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and > > Vietnamese? I didn't see a language pack for any except Thai, and that > one > > doesn't produce properly formatted characters for my source files. > They're > > missing the vowel marks. The other languages fail completely on my > setup. > > I've tried on OS X and Ubuntu 12. > > > > My source files are here: > > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf > > > > Chinese seems to work fine. > > > > I found out that PDF.js will produce good output, though I already have > > code based on pdftohtml output and would rather not switch if not > > necessary. I wonder if there is something wrong with my setup. > > > > Thanks for any help even if it's just a "nope, that's not possible" kind > of > > reply =) > > > > Rob > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > poppler mailing list > > [email protected] > > http://lists.freedesktop.org/mailman/listinfo/poppler > >
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
