dogbokchif opened a new pull request, #470:
URL: https://github.com/apache/pdfbox/pull/470

   ## Problem
   
   When multiple Unicode code points map to the same glyph, text extracted from 
a generated PDF may not match the character originally entered by the author.
   
   This commonly occurs in CJK fonts, where a CJK Unified Ideograph and its 
radical counterpart share a glyph. For example, in Noto Sans CJK, both 食 
(U+98DF) and ⻝ (U+2EDD, CJK RADICAL EAT ONE) map to the same glyph. As a 
result, rendering 食 in a PDF and extracting it with PDFTextStripper may 
incorrectly produce ⻝.
   
   ## Root Cause
   
   PDCIDFontType2Embedder.buildToUnicodeCMap() constructs the /ToUnicode map by 
reverse-looking up a glyph's code point using:
   ```java
   cmapLookup.getCharCodes(gid).get(0)
   ```
   However, getCharCodes() returns code points sorted in ascending order. 
Consequently, the first entry is always the lowest code point associated with 
the glyph. In the example above, U+2EDD is selected instead of U+98DF because 
it has the smaller value.
   
   As a result, the generated /ToUnicode mapping may point to a compatibility 
or radical character rather than the character that was actually used in the 
document. The existing comment ("use the first entry even for ambiguous 
mappings") already acknowledges this limitation.
   
   ##### This issue is not unique to PDFBox and has also been observed in 
[wkhtmltopdf](https://github.com/wkhtmltopdf/wkhtmltopdf/issues/4414)/[Qt](https://github.com/qt/qtbase/blob/389988c42f901f2d8f75a023039d641cf5fba9de/src/gui/text/qfontsubset.cpp#L200-L209),
 [Mozilla Firefox](https://bugzilla.mozilla.org/show_bug.cgi?id=1881196), and 
[Typst](https://github.com/typst/typst/issues/4582).
   
   ## Fix
   
   The code points actually used in the document are already tracked by 
TrueTypeEmbedder.addToSubset(int).
   
   This change builds a glyph → code point mapping from those recorded inputs 
and uses it when generating the /ToUnicode CMap. For glyphs associated with 
multiple code points, the first code point encountered in the document is 
preferred.
   
   The existing reverse cmap lookup is retained only as a fallback for glyphs 
that have no recorded input code point (for example, glyphs drawn directly by 
GID).
   
   To ensure deterministic behavior, subsetCodePoints is changed from a HashSet 
to a LinkedHashSet, preserving insertion order and making the "first occurrence 
wins" rule stable.
   
   This follows the same approach adopted by Typst to resolve the identical 
issue (typst/typst#4582, fixed in typst/typst#4585): prefer the code point that 
was actually used rather than reverse-mapping from the font cmap.
   
   ## Tests
   
   Two tests were added to TestFontEmbedding using Noto Sans CJK KR:
   
   #### testToUnicodePrefersUsedCodePoint
   
   Scans the font for any glyph shared by multiple printable code points 
(preferably a CJK ideograph/radical pair) and verifies that each code point 
round-trips through PDF generation and extraction as itself.
   
   #### testToUnicodeCjkAndRadicalLookAlike
   
   Uses the explicit pair 食 (U+98DF) and ⻝ (U+2EDD) and verifies that:
   
   both characters share the same glyph,
   the radical is the first entry returned by the cmap reverse lookup,
   an ideograph entered as 食 is extracted as 食, and
   an intentionally entered radical ⻝ is preserved as ⻝.
   
   Before this change, the test fails because 食 is extracted as ⻝.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to