https://bugs.documentfoundation.org/show_bug.cgi?id=158329
--- Comment #11 from David Huggins-Daines <[email protected]> --- (In reply to خالد حسني from comment #10) > instead have 2+ > glyphs mapped to 2+ characters which requires /ActualText which in turn is > badly supported in PDF readers and lead to this and the duplicate bug. Hi! Thank you for tracking down this problem! In the case of the duplicate bug (#161514) I am not convinced that, as you say, "The PDF has valid character data". The problem there is that the character <02> is not mapped to anything in the ToUnicode CMap: (content stream) /Span<</ActualText<FEFF0078030C>>> BDC 1 0 0 1 128.8 668.1 Tm /F1 72 Tf[<01>243<02>]TJ EMC (ToUnicode CMap) 2 beginbfchar <01> <0078030C> <03> <0075> endbfchar While it's true that the PDF 1.7 spec doesn't specifically say that all character codes in a font have to be defined in the ToUnicode CMap, instead providing this extremely helpful suggestion: > If these methods fail to produce a Unicode value, there is no way to > determine what the character code > represents in which case a conforming reader may choose a character code of > their choosing. ...one would hope that we can do better, given that we do actually know what the Unicode characters are and *exactly* which characters in the text object they are mapped to. I understand that it's necessary for rendering purposes to group them in grapheme clusters, but this isn't really the purpose of ToUnicode CMaps. The problem with /ActualText (aside from not being supported by any PDF readers except Acrobat...) is that there's no way to tell which characters in the /ActualText correspond to which characters in the text object, which becomes an issue for layout analysis and low-level text extraction in libraries like pdfminer/pdfplumber. I'm looking at implementing support for it there and this is a real stumbling block. -- You are receiving this mail because: You are the assignee for the bug.
