[
https://issues.apache.org/jira/browse/PDFBOX-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-6210:
------------------------------------
Description:
As described by @dogbokchif in attached PR
===
h2. Problem
When multiple Unicode code points map to the same glyph, text extracted from a
generated PDF may not match the character originally entered by the author.
This commonly occurs in CJK fonts, where a CJK Unified Ideograph and its
radical counterpart share a glyph. For example, in Noto Sans CJK, both 食
(U+98DF) and ⻝ (U+2EDD, CJK RADICAL EAT ONE) map to the same glyph. As a
result, rendering 食 in a PDF and extracting it with PDFTextStripper may
incorrectly produce ⻝.
h2. Root Cause
PDCIDFontType2Embedder.buildToUnicodeCMap() constructs the /ToUnicode map by
reverse-looking up a glyph's code point using:
cmapLookup.getCharCodes(gid).get(0)
However, getCharCodes() returns code points sorted in ascending order.
Consequently, the first entry is always the lowest code point associated with
the glyph. In the example above, U+2EDD is selected instead of U+98DF because
it has the smaller value.
As a result, the generated /ToUnicode mapping may point to a compatibility or
radical character rather than the character that was actually used in the
document. The existing comment ("use the first entry even for ambiguous
mappings") already acknowledges this limitation.
h5. This issue is not unique to PDFBox and has also been observed in
[wkhtmltopdf|https://github.com/wkhtmltopdf/wkhtmltopdf/issues/4414]/[Qt|https://github.com/qt/qtbase/blob/389988c42f901f2d8f75a023039d641cf5fba9de/src/gui/text/qfontsubset.cpp#L200-L209],
[Mozilla Firefox|https://bugzilla.mozilla.org/show_bug.cgi?id=1881196], and
[Typst|https://github.com/typst/typst/issues/4582].
h2. Fix
The code points actually used in the document are already tracked by
TrueTypeEmbedder.addToSubset(int).
This change builds a glyph → code point mapping from those recorded inputs and
uses it when generating the /ToUnicode CMap. For glyphs associated with
multiple code points, the first code point encountered in the document is
preferred.
The existing reverse cmap lookup is retained only as a fallback for glyphs that
have no recorded input code point (for example, glyphs drawn directly by GID).
To ensure deterministic behavior, subsetCodePoints is changed from a HashSet to
a LinkedHashSet, preserving insertion order and making the "first occurrence
wins" rule stable.
This follows the same approach adopted by Typst to resolve the identical issue
([typst/typst#4582|https://github.com/typst/typst/issues/4582], fixed in
[typst/typst#4585|https://github.com/typst/typst/pull/4585]): prefer the code
point that was actually used rather than reverse-mapping from the font cmap.
h2. Tests
Two tests were added to TestFontEmbedding using Noto Sans CJK KR:
h4. testToUnicodePrefersUsedCodePoint
Scans the font for any glyph shared by multiple printable code points
(preferably a CJK ideograph/radical pair) and verifies that each code point
round-trips through PDF generation and extraction as itself.
h4. testToUnicodeCjkAndRadicalLookAlike
Uses the explicit pair 食 (U+98DF) and ⻝ (U+2EDD) and verifies that:
both characters share the same glyph,
the radical is the first entry returned by the cmap reverse lookup,
an ideograph entered as 食 is extracted as 食, and
an intentionally entered radical ⻝ is preserved as ⻝.
Before this change, the test fails because 食 is extracted as ⻝.
was:
As described by Chanhyuk Lee in attached PR
===
h2. Problem
When multiple Unicode code points map to the same glyph, text extracted from a
generated PDF may not match the character originally entered by the author.
This commonly occurs in CJK fonts, where a CJK Unified Ideograph and its
radical counterpart share a glyph. For example, in Noto Sans CJK, both 食
(U+98DF) and ⻝ (U+2EDD, CJK RADICAL EAT ONE) map to the same glyph. As a
result, rendering 食 in a PDF and extracting it with PDFTextStripper may
incorrectly produce ⻝.
h2. Root Cause
PDCIDFontType2Embedder.buildToUnicodeCMap() constructs the /ToUnicode map by
reverse-looking up a glyph's code point using:
cmapLookup.getCharCodes(gid).get(0)
However, getCharCodes() returns code points sorted in ascending order.
Consequently, the first entry is always the lowest code point associated with
the glyph. In the example above, U+2EDD is selected instead of U+98DF because
it has the smaller value.
As a result, the generated /ToUnicode mapping may point to a compatibility or
radical character rather than the character that was actually used in the
document. The existing comment ("use the first entry even for ambiguous
mappings") already acknowledges this limitation.
h5. This issue is not unique to PDFBox and has also been observed in
[wkhtmltopdf|https://github.com/wkhtmltopdf/wkhtmltopdf/issues/4414]/[Qt|https://github.com/qt/qtbase/blob/389988c42f901f2d8f75a023039d641cf5fba9de/src/gui/text/qfontsubset.cpp#L200-L209],
[Mozilla Firefox|https://bugzilla.mozilla.org/show_bug.cgi?id=1881196], and
[Typst|https://github.com/typst/typst/issues/4582].
h2. Fix
The code points actually used in the document are already tracked by
TrueTypeEmbedder.addToSubset(int).
This change builds a glyph → code point mapping from those recorded inputs and
uses it when generating the /ToUnicode CMap. For glyphs associated with
multiple code points, the first code point encountered in the document is
preferred.
The existing reverse cmap lookup is retained only as a fallback for glyphs that
have no recorded input code point (for example, glyphs drawn directly by GID).
To ensure deterministic behavior, subsetCodePoints is changed from a HashSet to
a LinkedHashSet, preserving insertion order and making the "first occurrence
wins" rule stable.
This follows the same approach adopted by Typst to resolve the identical issue
([typst/typst#4582|https://github.com/typst/typst/issues/4582], fixed in
[typst/typst#4585|https://github.com/typst/typst/pull/4585]): prefer the code
point that was actually used rather than reverse-mapping from the font cmap.
h2. Tests
Two tests were added to TestFontEmbedding using Noto Sans CJK KR:
h4. testToUnicodePrefersUsedCodePoint
Scans the font for any glyph shared by multiple printable code points
(preferably a CJK ideograph/radical pair) and verifies that each code point
round-trips through PDF generation and extraction as itself.
h4. testToUnicodeCjkAndRadicalLookAlike
Uses the explicit pair 食 (U+98DF) and ⻝ (U+2EDD) and verifies that:
both characters share the same glyph,
the radical is the first entry returned by the cmap reverse lookup,
an ideograph entered as 食 is extracted as 食, and
an intentionally entered radical ⻝ is preserved as ⻝.
Before this change, the test fails because 食 is extracted as ⻝.
> Incorrect CJK Character Extraction for Shared Glyphs
> ----------------------------------------------------
>
> Key: PDFBOX-6210
> URL: https://issues.apache.org/jira/browse/PDFBOX-6210
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.36, 3.0.7 PDFBox
> Reporter: Tilman Hausherr
> Priority: Major
> Fix For: 2.0.37, 3.0.8 PDFBox, 4.0.0
>
>
> As described by @dogbokchif in attached PR
> ===
> h2. Problem
> When multiple Unicode code points map to the same glyph, text extracted from
> a generated PDF may not match the character originally entered by the author.
> This commonly occurs in CJK fonts, where a CJK Unified Ideograph and its
> radical counterpart share a glyph. For example, in Noto Sans CJK, both 食
> (U+98DF) and ⻝ (U+2EDD, CJK RADICAL EAT ONE) map to the same glyph. As a
> result, rendering 食 in a PDF and extracting it with PDFTextStripper may
> incorrectly produce ⻝.
> h2. Root Cause
> PDCIDFontType2Embedder.buildToUnicodeCMap() constructs the /ToUnicode map by
> reverse-looking up a glyph's code point using:
> cmapLookup.getCharCodes(gid).get(0)
>
> However, getCharCodes() returns code points sorted in ascending order.
> Consequently, the first entry is always the lowest code point associated with
> the glyph. In the example above, U+2EDD is selected instead of U+98DF because
> it has the smaller value.
> As a result, the generated /ToUnicode mapping may point to a compatibility or
> radical character rather than the character that was actually used in the
> document. The existing comment ("use the first entry even for ambiguous
> mappings") already acknowledges this limitation.
> h5. This issue is not unique to PDFBox and has also been observed in
> [wkhtmltopdf|https://github.com/wkhtmltopdf/wkhtmltopdf/issues/4414]/[Qt|https://github.com/qt/qtbase/blob/389988c42f901f2d8f75a023039d641cf5fba9de/src/gui/text/qfontsubset.cpp#L200-L209],
> [Mozilla Firefox|https://bugzilla.mozilla.org/show_bug.cgi?id=1881196], and
> [Typst|https://github.com/typst/typst/issues/4582].
> h2. Fix
> The code points actually used in the document are already tracked by
> TrueTypeEmbedder.addToSubset(int).
> This change builds a glyph → code point mapping from those recorded inputs
> and uses it when generating the /ToUnicode CMap. For glyphs associated with
> multiple code points, the first code point encountered in the document is
> preferred.
> The existing reverse cmap lookup is retained only as a fallback for glyphs
> that have no recorded input code point (for example, glyphs drawn directly by
> GID).
> To ensure deterministic behavior, subsetCodePoints is changed from a HashSet
> to a LinkedHashSet, preserving insertion order and making the "first
> occurrence wins" rule stable.
> This follows the same approach adopted by Typst to resolve the identical
> issue ([typst/typst#4582|https://github.com/typst/typst/issues/4582], fixed
> in [typst/typst#4585|https://github.com/typst/typst/pull/4585]): prefer the
> code point that was actually used rather than reverse-mapping from the font
> cmap.
> h2. Tests
> Two tests were added to TestFontEmbedding using Noto Sans CJK KR:
> h4. testToUnicodePrefersUsedCodePoint
> Scans the font for any glyph shared by multiple printable code points
> (preferably a CJK ideograph/radical pair) and verifies that each code point
> round-trips through PDF generation and extraction as itself.
> h4. testToUnicodeCjkAndRadicalLookAlike
> Uses the explicit pair 食 (U+98DF) and ⻝ (U+2EDD) and verifies that:
> both characters share the same glyph,
> the radical is the first entry returned by the cmap reverse lookup,
> an ideograph entered as 食 is extracted as 食, and
> an intentionally entered radical ⻝ is preserved as ⻝.
> Before this change, the test fails because 食 is extracted as ⻝.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]