[
https://issues.apache.org/jira/browse/PDFBOX-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088714#comment-18088714
]
ASF subversion and git services commented on PDFBOX-6210:
---------------------------------------------------------
Commit 1935229 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1935229 ]
PDFBOX-6210: Sonar fix
> Incorrect CJK Character Extraction for Shared Glyphs
> ----------------------------------------------------
>
> Key: PDFBOX-6210
> URL: https://issues.apache.org/jira/browse/PDFBOX-6210
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.36, 3.0.7 PDFBox
> Reporter: Tilman Hausherr
> Priority: Major
> Fix For: 2.0.37, 3.0.8 PDFBox, 4.0.0
>
>
> As described by [~dogbokchif] in attached PR
> ===
> h2. Problem
> When multiple Unicode code points map to the same glyph, text extracted from
> a generated PDF may not match the character originally entered by the author.
> This commonly occurs in CJK fonts, where a CJK Unified Ideograph and its
> radical counterpart share a glyph. For example, in Noto Sans CJK, both 食
> (U+98DF) and ⻝ (U+2EDD, CJK RADICAL EAT ONE) map to the same glyph. As a
> result, rendering 食 in a PDF and extracting it with PDFTextStripper may
> incorrectly produce ⻝.
> h2. Root Cause
> PDCIDFontType2Embedder.buildToUnicodeCMap() constructs the /ToUnicode map by
> reverse-looking up a glyph's code point using:
> cmapLookup.getCharCodes(gid).get(0)
>
> However, getCharCodes() returns code points sorted in ascending order.
> Consequently, the first entry is always the lowest code point associated with
> the glyph. In the example above, U+2EDD is selected instead of U+98DF because
> it has the smaller value.
> As a result, the generated /ToUnicode mapping may point to a compatibility or
> radical character rather than the character that was actually used in the
> document. The existing comment ("use the first entry even for ambiguous
> mappings") already acknowledges this limitation.
> h5. This issue is not unique to PDFBox and has also been observed in
> [wkhtmltopdf|https://github.com/wkhtmltopdf/wkhtmltopdf/issues/4414]/[Qt|https://github.com/qt/qtbase/blob/389988c42f901f2d8f75a023039d641cf5fba9de/src/gui/text/qfontsubset.cpp#L200-L209],
> [Mozilla Firefox|https://bugzilla.mozilla.org/show_bug.cgi?id=1881196], and
> [Typst|https://github.com/typst/typst/issues/4582].
> h2. Fix
> The code points actually used in the document are already tracked by
> TrueTypeEmbedder.addToSubset(int).
> This change builds a glyph → code point mapping from those recorded inputs
> and uses it when generating the /ToUnicode CMap. For glyphs associated with
> multiple code points, the first code point encountered in the document is
> preferred.
> The existing reverse cmap lookup is retained only as a fallback for glyphs
> that have no recorded input code point (for example, glyphs drawn directly by
> GID).
> To ensure deterministic behavior, subsetCodePoints is changed from a HashSet
> to a LinkedHashSet, preserving insertion order and making the "first
> occurrence wins" rule stable.
> This follows the same approach adopted by Typst to resolve the identical
> issue ([typst/typst#4582|https://github.com/typst/typst/issues/4582], fixed
> in [typst/typst#4585|https://github.com/typst/typst/pull/4585]): prefer the
> code point that was actually used rather than reverse-mapping from the font
> cmap.
> h2. Tests
> Two tests were added to TestFontEmbedding using Noto Sans CJK KR:
> h4. testToUnicodePrefersUsedCodePoint
> Scans the font for any glyph shared by multiple printable code points
> (preferably a CJK ideograph/radical pair) and verifies that each code point
> round-trips through PDF generation and extraction as itself.
> h4. testToUnicodeCjkAndRadicalLookAlike
> Uses the explicit pair 食 (U+98DF) and ⻝ (U+2EDD) and verifies that:
> both characters share the same glyph,
> the radical is the first entry returned by the cmap reverse lookup,
> an ideograph entered as 食 is extracted as 食, and
> an intentionally entered radical ⻝ is preserved as ⻝.
> Before this change, the test fails because 食 is extracted as ⻝.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]