On 14/9/15 16:40, Rob Hawkins wrote:
Thank you all for these great replies. I find the stuff about the
unicode encoding order really interesting. And I too wish we could find
more information about the as-yet unmapped Asian scripts.
I was mistaken about the output of PDF.js. I thought I had viewed the
HTML source and seen good data, how exciting! Yet now I that I double
check, I see it is just the viewer that is correct, and the source text
is garbled just like pdftotext etc.
I'm bummed there is no magic solution here as I thought I had found, but
glad to see people are still interested in this. If I find out how to
implement these languages, I will try.
I think what you're looking for is the ActualText feature in PDF. If
this is present, a viewer or text-extraction tool can use it to provide
the correct text, instead of trying to reconstruct the text from the
stream of glyphs in the PDF -- which, while it often works OK for
European languages and similar "simple" writing systems, is pretty much
doomed to failure for complex South/Southeast Asian scripts, etc.
But this is dependent on the PDF-generating tool or workflow including
the correct ActualText attributes in the first place. In my (very
limited) experience, this is pretty rare.
JK
> Alternatively, can we band
together to destroy PDFs everywhere? If we work in concert it may be
possible. =)
Thanks again,
Rob
On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya
<[email protected] <mailto:[email protected]>> wrote:
Dear Rob,
Poppler extracts the text from PDF via the serie of glyphs.
Therefore, the scripts that the Unicode encode the characters
as visible order, the first step of the text extraction is
possible.
However, some Asian scripts, especially Brahmic-based scripts,
have very complicated layout rules, so, the encoding order
in Unicode text is phonetic and different from the visible
order (e.g. coded characters are in consonant-then-vowel order,
but the displayed characters are in vowel-then-consonant order).
In such case, the character serie extracted via the glyph serie
is not good coded text.
I'm not sure which script you assume for Indonesian (Latin?
Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
only Thai script is coded in visible order. Other scripts
have vowel-then-consonant encoding issue, so, it is not easy
for Poppler to extract the text in correct "Unicode" text.
Therefore, the result you have (Thai is OK, others are not)
sounds reasonable.
I'm unfamiliar with the bleeding-edge technology in the latedt
PDF about how to deal with such complex script (I guess PDF
developers are willing to support such), but, the PDFs made
by old PDF production softwares may have similar problem.
I wish some Adobe experts mentions about the situation in the
latest PDF for complex scripts :-)
Regards,
mpsuzuki
Rob Hawkins wrote:
> Greetings all,
>
> Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
> Vietnamese? I didn't see a language pack for any except Thai,
and that one
> doesn't produce properly formatted characters for my source
files. They're
> missing the vowel marks. The other languages fail completely on
my setup.
> I've tried on OS X and Ubuntu 12.
>
> My source files are here:
> https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
>
> Chinese seems to work fine.
>
> I found out that PDF.js will produce good output, though I
already have
> code based on pdftohtml output and would rather not switch if not
> necessary. I wonder if there is something wrong with my setup.
>
> Thanks for any help even if it's just a "nope, that's not
possible" kind of
> reply =)
>
> Rob
>
>
>
>
------------------------------------------------------------------------
>
> _______________________________________________
> poppler mailing list
> [email protected] <mailto:[email protected]>
> http://lists.freedesktop.org/mailman/listinfo/poppler
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler