https://bugs.documentfoundation.org/show_bug.cgi?id=117428

--- Comment #28 from Jonathan Clark <[email protected]> ---
To update this bug, I briefly investigated the current state of text
extraction. I performed the following tests using a trivial Devanagari Writer
document containing only "नित्यानन्दकरी", then exported to PDF using our filter:

Adobe Acrobat Reader now extracts the correct text. This is an improvement over
the original report.

Evince also extracts the correct text. The macOS preview app crashed when I
tried to click on the text to select it, but using the keyboard I was able to
copy and paste the correct text.

Current stable Firefox (pdf.js) and Google Chrome do not seem to handle
ActualText at all. Both programs seem to replace glyphs without ToUnicode
mappings with an index, whether or not ActualText is specified. I also tested
with quick-and-dirty hacks to simulate ActualText per word, forcing ActualText
for every cluster, and using ActualText with no ToUnicode mappings; none of
these fixes improved the situation.

As noted above, ActualText per-word could have other benefits. Currently,
however, I don't think it would improve the text extraction situation. The
major blocker seems to be the readers that don't implement any ActualText
support at all, whether it's done per-word or per-cluster.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to