[jira] [Commented] (PDFBOX-6210) Incorrect CJK Character Extraction for Shared Glyphs

ASF subversion and git services (Jira) Sat, 13 Jun 2026 04:04:12 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088714#comment-18088714
 ]


ASF subversion and git services commented on PDFBOX-6210:
---------------------------------------------------------

Commit 1935229 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1935229 ]

PDFBOX-6210: Sonar fix

> Incorrect CJK Character Extraction for Shared Glyphs
> ----------------------------------------------------
>
>                 Key: PDFBOX-6210
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6210
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.36, 3.0.7 PDFBox
>            Reporter: Tilman Hausherr
>            Priority: Major
>             Fix For: 2.0.37, 3.0.8 PDFBox, 4.0.0
>
>
> As described by [~dogbokchif] in attached PR
> ===
> h2. Problem
> When multiple Unicode code points map to the same glyph, text extracted from 
> a generated PDF may not match the character originally entered by the author.
> This commonly occurs in CJK fonts, where a CJK Unified Ideograph and its 
> radical counterpart share a glyph. For example, in Noto Sans CJK, both 食 
> (U+98DF) and ⻝ (U+2EDD, CJK RADICAL EAT ONE) map to the same glyph. As a 
> result, rendering 食 in a PDF and extracting it with PDFTextStripper may 
> incorrectly produce ⻝.
> h2. Root Cause
> PDCIDFontType2Embedder.buildToUnicodeCMap() constructs the /ToUnicode map by 
> reverse-looking up a glyph's code point using:
> cmapLookup.getCharCodes(gid).get(0)
>  
> However, getCharCodes() returns code points sorted in ascending order. 
> Consequently, the first entry is always the lowest code point associated with 
> the glyph. In the example above, U+2EDD is selected instead of U+98DF because 
> it has the smaller value.
> As a result, the generated /ToUnicode mapping may point to a compatibility or 
> radical character rather than the character that was actually used in the 
> document. The existing comment ("use the first entry even for ambiguous 
> mappings") already acknowledges this limitation.
> h5. This issue is not unique to PDFBox and has also been observed in 
> [wkhtmltopdf|https://github.com/wkhtmltopdf/wkhtmltopdf/issues/4414]/[Qt|https://github.com/qt/qtbase/blob/389988c42f901f2d8f75a023039d641cf5fba9de/src/gui/text/qfontsubset.cpp#L200-L209],
>  [Mozilla Firefox|https://bugzilla.mozilla.org/show_bug.cgi?id=1881196], and 
> [Typst|https://github.com/typst/typst/issues/4582].
> h2. Fix
> The code points actually used in the document are already tracked by 
> TrueTypeEmbedder.addToSubset(int).
> This change builds a glyph → code point mapping from those recorded inputs 
> and uses it when generating the /ToUnicode CMap. For glyphs associated with 
> multiple code points, the first code point encountered in the document is 
> preferred.
> The existing reverse cmap lookup is retained only as a fallback for glyphs 
> that have no recorded input code point (for example, glyphs drawn directly by 
> GID).
> To ensure deterministic behavior, subsetCodePoints is changed from a HashSet 
> to a LinkedHashSet, preserving insertion order and making the "first 
> occurrence wins" rule stable.
> This follows the same approach adopted by Typst to resolve the identical 
> issue ([typst/typst#4582|https://github.com/typst/typst/issues/4582], fixed 
> in [typst/typst#4585|https://github.com/typst/typst/pull/4585]): prefer the 
> code point that was actually used rather than reverse-mapping from the font 
> cmap.
> h2. Tests
> Two tests were added to TestFontEmbedding using Noto Sans CJK KR:
> h4. testToUnicodePrefersUsedCodePoint
> Scans the font for any glyph shared by multiple printable code points 
> (preferably a CJK ideograph/radical pair) and verifies that each code point 
> round-trips through PDF generation and extraction as itself.
> h4. testToUnicodeCjkAndRadicalLookAlike
> Uses the explicit pair 食 (U+98DF) and ⻝ (U+2EDD) and verifies that:
> both characters share the same glyph,
> the radical is the first entry returned by the cmap reverse lookup,
> an ideograph entered as 食 is extracted as 食, and
> an intentionally entered radical ⻝ is preserved as ⻝.
> Before this change, the test fails because 食 is extracted as ⻝.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-6210) Incorrect CJK Character Extraction for Shared Glyphs

Reply via email to