https://bugs.documentfoundation.org/show_bug.cgi?id=158329

--- Comment #11 from David Huggins-Daines <[email protected]> ---
(In reply to ⁨خالد حسني⁩ from comment #10)
> instead have 2+
> glyphs mapped to 2+ characters which requires /ActualText which in turn is
> badly supported in PDF readers and lead to this and the duplicate bug.

Hi!  Thank you for tracking down this problem!

In the case of the duplicate bug (#161514) I am not convinced that, as you say,
"The PDF has valid character data".  The problem there is that the character
<02> is not mapped to anything in the ToUnicode CMap:

(content stream)
/Span<</ActualText<FEFF0078030C>>>
BDC
1 0 0 1 128.8 668.1 Tm
/F1 72 Tf[<01>243<02>]TJ
EMC
(ToUnicode CMap)
2 beginbfchar
<01> <0078030C>
<03> <0075>
endbfchar

While it's true that the PDF 1.7 spec doesn't specifically say that all
character codes in a font have to be defined in the ToUnicode CMap, instead
providing this extremely helpful suggestion:

> If these methods fail to produce a Unicode value, there is no way to 
> determine what the character code
> represents in which case a conforming reader may choose a character code of 
> their choosing.

...one would hope that we can do better, given that we do actually know what
the Unicode characters are and *exactly* which characters in the text object
they are mapped to.  I understand that it's necessary for rendering purposes to
group them in grapheme clusters, but this isn't really the purpose of ToUnicode
CMaps.

The problem with /ActualText (aside from not being supported by any PDF readers
except Acrobat...) is that there's no way to tell which characters in the
/ActualText correspond to which characters in the text object, which becomes an
issue for layout analysis and low-level text extraction in libraries like
pdfminer/pdfplumber.  I'm looking at implementing support for it there and this
is a real stumbling block.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to