On 14/9/15 16:40, Rob Hawkins wrote:
Thank you all for these great replies.  I find the stuff about the
unicode encoding order really interesting.  And I too wish we could find
more information about the as-yet unmapped Asian scripts.

I was mistaken about the output of PDF.js.  I thought I had viewed the
HTML source and seen good data, how exciting!  Yet now I that I double
check, I see it is just the viewer that is correct, and the source text
is garbled just like pdftotext etc.

I'm bummed there is no magic solution here as I thought I had found, but
glad to see people are still interested in this.  If I find out how to
implement these languages, I will try.

I think what you're looking for is the ActualText feature in PDF. If this is present, a viewer or text-extraction tool can use it to provide the correct text, instead of trying to reconstruct the text from the stream of glyphs in the PDF -- which, while it often works OK for European languages and similar "simple" writing systems, is pretty much doomed to failure for complex South/Southeast Asian scripts, etc.

But this is dependent on the PDF-generating tool or workflow including the correct ActualText attributes in the first place. In my (very limited) experience, this is pretty rare.

JK

> Alternatively, can we band
together to destroy PDFs everywhere?  If we work in concert it may be
possible. =)

Thanks again,

Rob

On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya
<[email protected] <mailto:[email protected]>> wrote:

    Dear Rob,

    Poppler extracts the text from PDF via the serie of glyphs.
    Therefore, the scripts that the Unicode encode the characters
    as visible order, the first step of the text extraction is
    possible.

    However, some Asian scripts, especially Brahmic-based scripts,
    have very complicated layout rules, so, the encoding order
    in Unicode text is phonetic and different from the visible
    order (e.g. coded characters are in consonant-then-vowel order,
    but the displayed characters are in vowel-then-consonant order).

    In such case, the character serie extracted via the glyph serie
    is not good coded text.

    I'm not sure which script you assume for Indonesian (Latin?
    Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
    only Thai script is coded in visible order. Other scripts
    have vowel-then-consonant encoding issue, so, it is not easy
    for Poppler to extract the text in correct "Unicode" text.
    Therefore, the result you have (Thai is OK, others are not)
    sounds reasonable.

    I'm unfamiliar with the bleeding-edge technology in the latedt
    PDF about how to deal with such complex script (I guess PDF
    developers are willing to support such), but, the PDFs made
    by old PDF production softwares may have similar problem.

    I wish some Adobe experts mentions about the situation in the
    latest PDF for complex scripts :-)

    Regards,
    mpsuzuki

    Rob Hawkins wrote:
     > Greetings all,
     >
     > Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
     > Vietnamese?  I didn't see a language pack for any except Thai,
    and that one
     > doesn't produce properly formatted characters for my source
    files.  They're
     > missing the vowel marks.  The other languages fail completely on
    my setup.
     > I've tried on OS X and Ubuntu 12.
     >
     > My source files are here:
     > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
     >
     > Chinese seems to work fine.
     >
     > I found out that PDF.js will produce good output, though I
    already have
     > code based on pdftohtml output and would rather not switch if not
     > necessary.  I wonder if there is something wrong with my setup.
     >
     > Thanks for any help even if it's just a "nope, that's not
    possible" kind of
     > reply =)
     >
     > Rob
     >
     >
     >
     >
    ------------------------------------------------------------------------
     >
     > _______________________________________________
     > poppler mailing list
     > [email protected] <mailto:[email protected]>
     > http://lists.freedesktop.org/mailman/listinfo/poppler




_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler


_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to