[
https://issues.apache.org/jira/browse/PDFBOX-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081893#comment-18081893
]
Tilman Hausherr edited comment on PDFBOX-4951 at 5/19/26 4:41 AM:
------------------------------------------------------------------
I thought I should make an effort to look into this (PR 436). So you're adding
a processor that returns something independent of whatever library is used to
get the positions (which I like). I then asked copilot about it. Re the
answers, IIRC we don't expect multithreading. The other complaint sounds valid.
===
*Assessment: not ready to approve yet.* The PR is promising and mergeable from
GitHub’s perspective, but I see at least {*}two substantive review concerns{*}:
one likely functional bug in the new glyph-positioning path, and one API/design
risk around mutable shared state.
h2. PR summary
* *PR:* #436 — _GlyphLayoutProcessor for correct glyph layout and support of
DIN 91379 (Issue PDFBOX-4951)_
* *Author:* {{vk-github18}}
* *State:* Open, not draft
* *Mergeability:* {{clean}} / mergeable
* *Scope:* {*}13 files changed{*}, {*}+1084 / -2{*}, *3 commits*
* *Reviews/comments:* no reviews, no inline review comments, no PR comments
* *Checks:* no check runs were reported for the head commit
h3. What it does
This PR introduces a new {{GlyphLayoutProcessor}} pipeline that uses Java AWT
glyph layout to improve positioning of complex Latin glyph sequences,
especially for *DIN 91379* support. It wires that processor into:
* normal content streams
* AcroForm appearance generation
* example code/resources demonstrating usage
Core implementation files:
* {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphLayoutProcessor.java}}
* {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphLayoutFontLoader.java}}
* {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphsAndPositions.java}}
* modifications in {{PDAbstractContentStream}}
* modifications in {{PDAcroForm}}
* modifications in {{AppearanceGeneratorHelper}}
h2. Core changes
# *Adds a glyph-layout abstraction*
*
** Loads a PDFBox {{PDType0Font}} and matching AWT {{Font}}
** Uses {{layoutGlyphVector(...)}} to compute glyph order/positions
** Emits PDF text with explicit positioning adjustments when needed
# *Extends content-stream writing*
*
** {{PDAbstractContentStream.showText(...)}} now conditionally routes through
{{GlyphLayoutProcessor}}
** Adds lower-level helpers for writing glyph IDs and TJ-style positioning
arrays
# *Extends AcroForm handling*
*
** {{PDAcroForm}} can store a {{GlyphLayoutProcessor}}
** appearance generation forwards it into form field rendering
# *Adds examples/resources*
*
** two Java examples
** sample form/pdf/font resources
** license file for bundled font
----
h2. Review findings
h3. 1. *Likely bug: {{showText(String)}} no longer enforces the old
preconditions before delegating*
In {{{}PDAbstractContentStream.showText(String){}}}, the old implementation
always went through {{showTextInternal(text)}} first, which performed the usual
safety checks like text mode and font setup. The new implementation bypasses
that when glyph layout is enabled.
Before, the method effectively always validated state before writing text. Now:
apache / pdfbox / files
{{public void showText(String text) throws IOException
{ PDFont font = fontStack.peek(); if(glyphLayoutProcessor != null &&
glyphLayoutProcessor.supportsFont(font))
{
glyphLayoutProcessor.showText(text);}}
*Problem:* {{fontStack.peek()}} is called before checking whether:
* text mode has begun
* a font has been set
* {{fontStack}} is non-empty
That changes behavior from a controlled {{IllegalStateException}} path into a
likely {{{}EmptyStackException{}}}/state inconsistency path.
*Why this matters:* this is a public core rendering API path. Regressing error
semantics in {{showText()}} is risky and could break callers in surprising ways.
*Suggested fix:* preserve the old precondition flow before branching to
glyph-layout behavior. For example, validate via existing internal checks
first, or explicitly check {{inTextMode}} / {{fontStack.isEmpty()}} before
{{{}peek(){}}}.
----
h3. 2. *Thread-safety / API-state risk in {{GlyphLayoutProcessor}}*
The PR description says this is “Designed for multithreading environment,” but
the processor stores mutable per-write state:
* {{private PDType0Font font;}}
* {{private float fontSize;}}
* {{private PDAbstractContentStream contentStream;}}
and these are mutated through:
* {{setContentStream(...)}}
* {{setFontAndSize(...)}}
That means the processor is not inherently thread-safe if reused across content
streams, pages, or forms concurrently.
Relevant design shape:
apache / pdfbox / files
{{private PDType0Font font;private float fontSize;private
PDAbstractContentStream contentStream;}}
*Why this matters:* the PR explicitly advertises multithreaded suitability, but
the object is stateful and shared by reference through setters. If a caller
reuses one processor instance across threads or streams, one write could
overwrite another’s {{{}font{}}}, {{{}fontSize{}}}, or {{{}contentStream{}}}.
*Suggested fix:* either:
* make the processor stateless for layout operations by passing
stream/font/fontSize as method parameters, or
* clearly document that a processor instance must not be shared concurrently,
or
* scope processor state to the call rather than the object
As written, I would not accept the multithreading claim.
----
h2. Other notable changes
h3. New low-level glyph writing helpers
The PR adds {{{}writeTextPDType0Font(...){}}},
{{{}showTextPDType0Font(...){}}}, and {{showGlyphsWithPositioning(...)}} in
{{{}PDAbstractContentStream{}}}.
This is a reasonable direction, but it increases the surface area of
content-stream internals and duplicates some existing text-writing logic. Worth
checking for consistency with existing subset/encoding assumptions.
h3. AcroForm integration
The AcroForm wiring is straightforward:
* {{PDAcroForm}} stores the processor
* {{AppearanceGeneratorHelper}} and page append streams apply it if present
That part looks conceptually coherent.
h3. Examples/resources
Useful for demonstrating the feature, though the PR is fairly example-heavy
relative to the core API addition.
----
h2. Merge readiness and risk
h3. Readiness
*Not ready for approval* yet.
h3. Risk
*Medium to high* because this modifies a central text-rendering class:
* {{PDAbstractContentStream.showText(...)}} is core behavior
* text encoding/subsetting paths are delicate
* form rendering is affected too
h3. Blockers before merge
# *Fix {{showText()}} state validation regression*
# *Resolve or document the thread-safety contradiction*
# Ideally add tests covering:
** {{showText()}} error behavior with and without glyph layout enabled
** glyph-layout output for combining marks / DIN 91379 sample cases
** AcroForm appearance generation with the processor enabled
----
h2. Possible improvements
* *API naming/docs cleanup:* {{GlyphLayoutProcessor}} is more than a
processor; it also owns font loading and stream state.
* *Typos/comments:* there are a few small comment/doc issues like “lyph layout
processor”, “specofierd”, “embesSubset”.
* *State minimization:* {{newLineAtOffsetBasic(...)}} is added but not used in
the visible patch; if unused, avoid expanding API surface unnecessarily.
* *Resource handling:* {{loadFont(...)}} copies the full font stream into
memory. That may be acceptable, but deserves a comment since it’s done to
initialize both PDFBox and AWT font representations.
----
h2. Recommendation
{*}Request changes{*}, specifically on:
# preserving {{showText()}} precondition behavior
# clarifying/fixing thread-safety of {{GlyphLayoutProcessor}}
If you want, I can next do one of these:
# *Draft exact review comments* you can paste on the PR
# *Deep-dive a specific file* like {{PDAbstractContentStream.java}}
# *Look for missing tests / propose test cases*
# *Summarize the PR in maintainer-friendly approval language once blockers are
fixed*
was (Author: tilman):
I thought I should make an effort to look into this (PR 436). So you're adding
a processor that returns something independent of whatever library is used to
get the positions. I then asked copilot about it. Re the answers, IIRC we don't
expect multithreading. The other complaint sounds valid.
===
*Assessment: not ready to approve yet.* The PR is promising and mergeable from
GitHub’s perspective, but I see at least {*}two substantive review concerns{*}:
one likely functional bug in the new glyph-positioning path, and one API/design
risk around mutable shared state.
h2. PR summary
* *PR:* #436 — _GlyphLayoutProcessor for correct glyph layout and support of
DIN 91379 (Issue PDFBOX-4951)_
* *Author:* {{vk-github18}}
* *State:* Open, not draft
* *Mergeability:* {{clean}} / mergeable
* *Scope:* {*}13 files changed{*}, {*}+1084 / -2{*}, *3 commits*
* *Reviews/comments:* no reviews, no inline review comments, no PR comments
* *Checks:* no check runs were reported for the head commit
h3. What it does
This PR introduces a new {{GlyphLayoutProcessor}} pipeline that uses Java AWT
glyph layout to improve positioning of complex Latin glyph sequences,
especially for *DIN 91379* support. It wires that processor into:
* normal content streams
* AcroForm appearance generation
* example code/resources demonstrating usage
Core implementation files:
* {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphLayoutProcessor.java}}
* {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphLayoutFontLoader.java}}
* {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphsAndPositions.java}}
* modifications in {{PDAbstractContentStream}}
* modifications in {{PDAcroForm}}
* modifications in {{AppearanceGeneratorHelper}}
h2. Core changes
# *Adds a glyph-layout abstraction*
** Loads a PDFBox {{PDType0Font}} and matching AWT {{Font}}
** Uses {{layoutGlyphVector(...)}} to compute glyph order/positions
** Emits PDF text with explicit positioning adjustments when needed
# *Extends content-stream writing*
** {{PDAbstractContentStream.showText(...)}} now conditionally routes through
{{GlyphLayoutProcessor}}
** Adds lower-level helpers for writing glyph IDs and TJ-style positioning
arrays
# *Extends AcroForm handling*
** {{PDAcroForm}} can store a {{GlyphLayoutProcessor}}
** appearance generation forwards it into form field rendering
# *Adds examples/resources*
** two Java examples
** sample form/pdf/font resources
** license file for bundled font
----
h2. Review findings
h3. 1. *Likely bug: {{showText(String)}} no longer enforces the old
preconditions before delegating*
In {{{}PDAbstractContentStream.showText(String){}}}, the old implementation
always went through {{showTextInternal(text)}} first, which performed the usual
safety checks like text mode and font setup. The new implementation bypasses
that when glyph layout is enabled.
Before, the method effectively always validated state before writing text. Now:
apache / pdfbox / files
{{public void showText(String text) throws IOException
{ PDFont font = fontStack.peek(); if(glyphLayoutProcessor != null &&
glyphLayoutProcessor.supportsFont(font))
{
glyphLayoutProcessor.showText(text);}}
*Problem:* {{fontStack.peek()}} is called before checking whether:
* text mode has begun
* a font has been set
* {{fontStack}} is non-empty
That changes behavior from a controlled {{IllegalStateException}} path into a
likely {{{}EmptyStackException{}}}/state inconsistency path.
*Why this matters:* this is a public core rendering API path. Regressing error
semantics in {{showText()}} is risky and could break callers in surprising ways.
*Suggested fix:* preserve the old precondition flow before branching to
glyph-layout behavior. For example, validate via existing internal checks
first, or explicitly check {{inTextMode}} / {{fontStack.isEmpty()}} before
{{{}peek(){}}}.
----
h3. 2. *Thread-safety / API-state risk in {{GlyphLayoutProcessor}}*
The PR description says this is “Designed for multithreading environment,” but
the processor stores mutable per-write state:
* {{private PDType0Font font;}}
* {{private float fontSize;}}
* {{private PDAbstractContentStream contentStream;}}
and these are mutated through:
* {{setContentStream(...)}}
* {{setFontAndSize(...)}}
That means the processor is not inherently thread-safe if reused across content
streams, pages, or forms concurrently.
Relevant design shape:
apache / pdfbox / files
{{private PDType0Font font;private float fontSize;private
PDAbstractContentStream contentStream;}}
*Why this matters:* the PR explicitly advertises multithreaded suitability, but
the object is stateful and shared by reference through setters. If a caller
reuses one processor instance across threads or streams, one write could
overwrite another’s {{{}font{}}}, {{{}fontSize{}}}, or {{{}contentStream{}}}.
*Suggested fix:* either:
* make the processor stateless for layout operations by passing
stream/font/fontSize as method parameters, or
* clearly document that a processor instance must not be shared concurrently,
or
* scope processor state to the call rather than the object
As written, I would not accept the multithreading claim.
----
h2. Other notable changes
h3. New low-level glyph writing helpers
The PR adds {{{}writeTextPDType0Font(...){}}},
{{{}showTextPDType0Font(...){}}}, and {{showGlyphsWithPositioning(...)}} in
{{{}PDAbstractContentStream{}}}.
This is a reasonable direction, but it increases the surface area of
content-stream internals and duplicates some existing text-writing logic. Worth
checking for consistency with existing subset/encoding assumptions.
h3. AcroForm integration
The AcroForm wiring is straightforward:
* {{PDAcroForm}} stores the processor
* {{AppearanceGeneratorHelper}} and page append streams apply it if present
That part looks conceptually coherent.
h3. Examples/resources
Useful for demonstrating the feature, though the PR is fairly example-heavy
relative to the core API addition.
----
h2. Merge readiness and risk
h3. Readiness
*Not ready for approval* yet.
h3. Risk
*Medium to high* because this modifies a central text-rendering class:
* {{PDAbstractContentStream.showText(...)}} is core behavior
* text encoding/subsetting paths are delicate
* form rendering is affected too
h3. Blockers before merge
# *Fix {{showText()}} state validation regression*
# *Resolve or document the thread-safety contradiction*
# Ideally add tests covering:
** {{showText()}} error behavior with and without glyph layout enabled
** glyph-layout output for combining marks / DIN 91379 sample cases
** AcroForm appearance generation with the processor enabled
----
h2. Possible improvements
* *API naming/docs cleanup:* {{GlyphLayoutProcessor}} is more than a
processor; it also owns font loading and stream state.
* *Typos/comments:* there are a few small comment/doc issues like “lyph layout
processor”, “specofierd”, “embesSubset”.
* *State minimization:* {{newLineAtOffsetBasic(...)}} is added but not used in
the visible patch; if unused, avoid expanding API surface unnecessarily.
* *Resource handling:* {{loadFont(...)}} copies the full font stream into
memory. That may be acceptable, but deserves a comment since it’s done to
initialize both PDFBox and AWT font representations.
----
h2. Recommendation
{*}Request changes{*}, specifically on:
# preserving {{showText()}} precondition behavior
# clarifying/fixing thread-safety of {{GlyphLayoutProcessor}}
If you want, I can next do one of these:
# *Draft exact review comments* you can paste on the PR
# *Deep-dive a specific file* like {{PDAbstractContentStream.java}}
# *Look for missing tests / propose test cases*
# *Summarize the PR in maintainer-friendly approval language once blockers are
fixed*
> Sequences of DIN SPEC 91379 with combining letters are rendered incorrectly
> ---------------------------------------------------------------------------
>
> Key: PDFBOX-4951
> URL: https://issues.apache.org/jira/browse/PDFBOX-4951
> Project: PDFBox
> Issue Type: Bug
> Components: Rendering
> Affects Versions: 2.0.21
> Reporter: Volker Kunert
> Priority: Major
> Attachments: DIN_SPEC_91379_Sequences-aa.pdf,
> DIN_SPEC_91379_Sequences-ab.pdf, DIN_SPEC_91379_Sequences-ac.pdf,
> DIN_SPEC_91379_Sequences.txt, DefaultScriptProcessor.java,
> DoGlyphLayoutDinSpec91379.pdf, DoGlyphLayoutDinSpec91379Form.pdf,
> DoGlyphPositionBengali.pdf, ExamplePdfboxFopPos-By-Tilman.pdf,
> ExamplePdfboxFopPos.java, ExamplePdfboxFopPos.pdf,
> ExamplePdfboxFopPosForm.java, ExamplePdfboxFopPosForm.pdf, TestPdfbox.java,
> TestPdfboxFop2.java, TestPdfboxFop2.pdf, TestPdfboxJava2D.java,
> TestPdfboxJava2D.pdf, patch-2020-10-02.txt, pdfbox.patch, pdfbox.pdf,
> screenshot-1.png
>
>
> Accented Letters composed of Unicode base letter and combining accent are
> rendered wrong. E.g. with 0041 030B LATIN CAPITAL LETTER A WITH COMBINING
> DOUBLE ACUTE ACCENT the accent appears at the right hand side of the letter
> A, not above the letter A.
> The position is wrong for most of the sequences defined in the following spec:
> DIN SPEC 91379: Characters in Unicode for the electronic processing of names
> and data
> exchange in Europe; with digital attachment
> [https://www.xoev.de/downloads-2316#StringLatin]
> [https://www.din.de/de/wdc-beuth:din21:301228458]
>
> The correct rendering should look like the output of hb-view 2.6.8, see files
> DIN_SPEC_91379_Sequences*.pdf.
> The output of PDFBox is appended in pdfbox.pdf, which is created by running
> TestPdfbox.java. The sequences are read from file
> DIN_SPEC_91379_Sequences.txt.
>
> Font used for testing: NotoSansMono-Regular.ttf, see
> [https://www.google.com/get/noto/]
> download:
> [https://noto-website-2.storage.googleapis.com/pkgs/NotoSansMono-hinted.zip]
> See also FOP-2969
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]