[ 
https://issues.apache.org/jira/browse/PDFBOX-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081893#comment-18081893
 ] 

Tilman Hausherr edited comment on PDFBOX-4951 at 5/19/26 4:43 AM:
------------------------------------------------------------------

I thought I should make an effort to look into this (PR 436). So you're adding 
a processor that returns something independent of whatever library is used to 
get the positions (which I like). I then asked copilot about it. Re the 
answers, IIRC we don't expect multithreading. The other complaint sounds valid. 
Besides that, I don't make these opinions my own.

===

*Assessment: not ready to approve yet.* The PR is promising and mergeable from 
GitHub’s perspective, but I see at least {*}two substantive review concerns{*}: 
one likely functional bug in the new glyph-positioning path, and one API/design 
risk around mutable shared state.
h2. PR summary
 * *PR:* #436 — _GlyphLayoutProcessor for correct glyph layout and support of 
DIN 91379 (Issue PDFBOX-4951)_
 * *Author:* {{vk-github18}}
 * *State:* Open, not draft
 * *Mergeability:* {{clean}} / mergeable
 * *Scope:* {*}13 files changed{*}, {*}+1084 / -2{*}, *3 commits*
 * *Reviews/comments:* no reviews, no inline review comments, no PR comments
 * *Checks:* no check runs were reported for the head commit

h3. What it does

This PR introduces a new {{GlyphLayoutProcessor}} pipeline that uses Java AWT 
glyph layout to improve positioning of complex Latin glyph sequences, 
especially for *DIN 91379* support. It wires that processor into:
 * normal content streams
 * AcroForm appearance generation
 * example code/resources demonstrating usage

Core implementation files:
 * {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphLayoutProcessor.java}}
 * {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphLayoutFontLoader.java}}
 * {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphsAndPositions.java}}
 * modifications in {{PDAbstractContentStream}}
 * modifications in {{PDAcroForm}}
 * modifications in {{AppearanceGeneratorHelper}}

h2. Core changes
 # *Adds a glyph-layout abstraction*

 * 
 ** Loads a PDFBox {{PDType0Font}} and matching AWT {{Font}}
 ** Uses {{layoutGlyphVector(...)}} to compute glyph order/positions
 ** Emits PDF text with explicit positioning adjustments when needed

 # *Extends content-stream writing*

 * 
 ** {{PDAbstractContentStream.showText(...)}} now conditionally routes through 
{{GlyphLayoutProcessor}}
 ** Adds lower-level helpers for writing glyph IDs and TJ-style positioning 
arrays

 # *Extends AcroForm handling*

 * 
 ** {{PDAcroForm}} can store a {{GlyphLayoutProcessor}}
 ** appearance generation forwards it into form field rendering

 # *Adds examples/resources*

 * 
 ** two Java examples
 ** sample form/pdf/font resources
 ** license file for bundled font

----
h2. Review findings
h3. 1. *Likely bug: {{showText(String)}} no longer enforces the old 
preconditions before delegating*

In {{{}PDAbstractContentStream.showText(String){}}}, the old implementation 
always went through {{showTextInternal(text)}} first, which performed the usual 
safety checks like text mode and font setup. The new implementation bypasses 
that when glyph layout is enabled.

Before, the method effectively always validated state before writing text. Now:
apache / pdfbox / files
 
 {{public void showText(String text) throws IOException
{ PDFont font = fontStack.peek(); if(glyphLayoutProcessor != null && 
glyphLayoutProcessor.supportsFont(font))
{
glyphLayoutProcessor.showText(text);}}
*Problem:* {{fontStack.peek()}} is called before checking whether:
 * text mode has begun
 * a font has been set
 * {{fontStack}} is non-empty

That changes behavior from a controlled {{IllegalStateException}} path into a 
likely {{{}EmptyStackException{}}}/state inconsistency path.

*Why this matters:* this is a public core rendering API path. Regressing error 
semantics in {{showText()}} is risky and could break callers in surprising ways.

*Suggested fix:* preserve the old precondition flow before branching to 
glyph-layout behavior. For example, validate via existing internal checks 
first, or explicitly check {{inTextMode}} / {{fontStack.isEmpty()}} before 
{{{}peek(){}}}.
----
h3. 2. *Thread-safety / API-state risk in {{GlyphLayoutProcessor}}*

The PR description says this is “Designed for multithreading environment,” but 
the processor stores mutable per-write state:
 * {{private PDType0Font font;}}
 * {{private float fontSize;}}
 * {{private PDAbstractContentStream contentStream;}}

and these are mutated through:
 * {{setContentStream(...)}}
 * {{setFontAndSize(...)}}

That means the processor is not inherently thread-safe if reused across content 
streams, pages, or forms concurrently.

Relevant design shape:
apache / pdfbox / files
 
 {{private PDType0Font font;private float fontSize;private 
PDAbstractContentStream contentStream;}}
*Why this matters:* the PR explicitly advertises multithreaded suitability, but 
the object is stateful and shared by reference through setters. If a caller 
reuses one processor instance across threads or streams, one write could 
overwrite another’s {{{}font{}}}, {{{}fontSize{}}}, or {{{}contentStream{}}}.

*Suggested fix:* either:
 * make the processor stateless for layout operations by passing 
stream/font/fontSize as method parameters, or
 * clearly document that a processor instance must not be shared concurrently, 
or
 * scope processor state to the call rather than the object

As written, I would not accept the multithreading claim.
----
h2. Other notable changes
h3. New low-level glyph writing helpers

The PR adds {{{}writeTextPDType0Font(...){}}}, 
{{{}showTextPDType0Font(...){}}}, and {{showGlyphsWithPositioning(...)}} in 
{{{}PDAbstractContentStream{}}}.

This is a reasonable direction, but it increases the surface area of 
content-stream internals and duplicates some existing text-writing logic. Worth 
checking for consistency with existing subset/encoding assumptions.
h3. AcroForm integration

The AcroForm wiring is straightforward:
 * {{PDAcroForm}} stores the processor
 * {{AppearanceGeneratorHelper}} and page append streams apply it if present

That part looks conceptually coherent.
h3. Examples/resources

Useful for demonstrating the feature, though the PR is fairly example-heavy 
relative to the core API addition.
----
h2. Merge readiness and risk
h3. Readiness

*Not ready for approval* yet.
h3. Risk

*Medium to high* because this modifies a central text-rendering class:
 * {{PDAbstractContentStream.showText(...)}} is core behavior
 * text encoding/subsetting paths are delicate
 * form rendering is affected too

h3. Blockers before merge
 # *Fix {{showText()}} state validation regression*
 # *Resolve or document the thread-safety contradiction*
 # Ideally add tests covering:
 ** {{showText()}} error behavior with and without glyph layout enabled
 ** glyph-layout output for combining marks / DIN 91379 sample cases
 ** AcroForm appearance generation with the processor enabled

----
h2. Possible improvements
 * *API naming/docs cleanup:* {{GlyphLayoutProcessor}} is more than a 
processor; it also owns font loading and stream state.
 * *Typos/comments:* there are a few small comment/doc issues like “lyph layout 
processor”, “specofierd”, “embesSubset”.
 * *State minimization:* {{newLineAtOffsetBasic(...)}} is added but not used in 
the visible patch; if unused, avoid expanding API surface unnecessarily.
 * *Resource handling:* {{loadFont(...)}} copies the full font stream into 
memory. That may be acceptable, but deserves a comment since it’s done to 
initialize both PDFBox and AWT font representations.

----
h2. Recommendation

{*}Request changes{*}, specifically on:
 # preserving {{showText()}} precondition behavior
 # clarifying/fixing thread-safety of {{GlyphLayoutProcessor}}

If you want, I can next do one of these:
 # *Draft exact review comments* you can paste on the PR
 # *Deep-dive a specific file* like {{PDAbstractContentStream.java}}
 # *Look for missing tests / propose test cases*
 # *Summarize the PR in maintainer-friendly approval language once blockers are 
fixed*


was (Author: tilman):
I thought I should make an effort to look into this (PR 436). So you're adding 
a processor that returns something independent of whatever library is used to 
get the positions (which I like). I then asked copilot about it. Re the 
answers, IIRC we don't expect multithreading. The other complaint sounds valid.

===

*Assessment: not ready to approve yet.* The PR is promising and mergeable from 
GitHub’s perspective, but I see at least {*}two substantive review concerns{*}: 
one likely functional bug in the new glyph-positioning path, and one API/design 
risk around mutable shared state.
h2. PR summary
 * *PR:* #436 — _GlyphLayoutProcessor for correct glyph layout and support of 
DIN 91379 (Issue PDFBOX-4951)_
 * *Author:* {{vk-github18}}
 * *State:* Open, not draft
 * *Mergeability:* {{clean}} / mergeable
 * *Scope:* {*}13 files changed{*}, {*}+1084 / -2{*}, *3 commits*
 * *Reviews/comments:* no reviews, no inline review comments, no PR comments
 * *Checks:* no check runs were reported for the head commit

h3. What it does

This PR introduces a new {{GlyphLayoutProcessor}} pipeline that uses Java AWT 
glyph layout to improve positioning of complex Latin glyph sequences, 
especially for *DIN 91379* support. It wires that processor into:
 * normal content streams
 * AcroForm appearance generation
 * example code/resources demonstrating usage

Core implementation files:
 * {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphLayoutProcessor.java}}
 * {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphLayoutFontLoader.java}}
 * {{pdfbox/src/main/java/org/apache/pdfbox/pdmodel/GlyphsAndPositions.java}}
 * modifications in {{PDAbstractContentStream}}
 * modifications in {{PDAcroForm}}
 * modifications in {{AppearanceGeneratorHelper}}

h2. Core changes
 # *Adds a glyph-layout abstraction*

 * 
 ** Loads a PDFBox {{PDType0Font}} and matching AWT {{Font}}
 ** Uses {{layoutGlyphVector(...)}} to compute glyph order/positions
 ** Emits PDF text with explicit positioning adjustments when needed

 # *Extends content-stream writing*

 * 
 ** {{PDAbstractContentStream.showText(...)}} now conditionally routes through 
{{GlyphLayoutProcessor}}
 ** Adds lower-level helpers for writing glyph IDs and TJ-style positioning 
arrays

 # *Extends AcroForm handling*

 * 
 ** {{PDAcroForm}} can store a {{GlyphLayoutProcessor}}
 ** appearance generation forwards it into form field rendering

 # *Adds examples/resources*

 * 
 ** two Java examples
 ** sample form/pdf/font resources
 ** license file for bundled font

----
h2. Review findings
h3. 1. *Likely bug: {{showText(String)}} no longer enforces the old 
preconditions before delegating*

In {{{}PDAbstractContentStream.showText(String){}}}, the old implementation 
always went through {{showTextInternal(text)}} first, which performed the usual 
safety checks like text mode and font setup. The new implementation bypasses 
that when glyph layout is enabled.

Before, the method effectively always validated state before writing text. Now:
apache / pdfbox / files
 
 {{public void showText(String text) throws IOException
{ PDFont font = fontStack.peek(); if(glyphLayoutProcessor != null && 
glyphLayoutProcessor.supportsFont(font))
{
glyphLayoutProcessor.showText(text);}}
*Problem:* {{fontStack.peek()}} is called before checking whether:
 * text mode has begun
 * a font has been set
 * {{fontStack}} is non-empty

That changes behavior from a controlled {{IllegalStateException}} path into a 
likely {{{}EmptyStackException{}}}/state inconsistency path.

*Why this matters:* this is a public core rendering API path. Regressing error 
semantics in {{showText()}} is risky and could break callers in surprising ways.

*Suggested fix:* preserve the old precondition flow before branching to 
glyph-layout behavior. For example, validate via existing internal checks 
first, or explicitly check {{inTextMode}} / {{fontStack.isEmpty()}} before 
{{{}peek(){}}}.
----
h3. 2. *Thread-safety / API-state risk in {{GlyphLayoutProcessor}}*

The PR description says this is “Designed for multithreading environment,” but 
the processor stores mutable per-write state:
 * {{private PDType0Font font;}}
 * {{private float fontSize;}}
 * {{private PDAbstractContentStream contentStream;}}

and these are mutated through:
 * {{setContentStream(...)}}
 * {{setFontAndSize(...)}}

That means the processor is not inherently thread-safe if reused across content 
streams, pages, or forms concurrently.

Relevant design shape:
apache / pdfbox / files
 
 {{private PDType0Font font;private float fontSize;private 
PDAbstractContentStream contentStream;}}
*Why this matters:* the PR explicitly advertises multithreaded suitability, but 
the object is stateful and shared by reference through setters. If a caller 
reuses one processor instance across threads or streams, one write could 
overwrite another’s {{{}font{}}}, {{{}fontSize{}}}, or {{{}contentStream{}}}.

*Suggested fix:* either:
 * make the processor stateless for layout operations by passing 
stream/font/fontSize as method parameters, or
 * clearly document that a processor instance must not be shared concurrently, 
or
 * scope processor state to the call rather than the object

As written, I would not accept the multithreading claim.
----
h2. Other notable changes
h3. New low-level glyph writing helpers

The PR adds {{{}writeTextPDType0Font(...){}}}, 
{{{}showTextPDType0Font(...){}}}, and {{showGlyphsWithPositioning(...)}} in 
{{{}PDAbstractContentStream{}}}.

This is a reasonable direction, but it increases the surface area of 
content-stream internals and duplicates some existing text-writing logic. Worth 
checking for consistency with existing subset/encoding assumptions.
h3. AcroForm integration

The AcroForm wiring is straightforward:
 * {{PDAcroForm}} stores the processor
 * {{AppearanceGeneratorHelper}} and page append streams apply it if present

That part looks conceptually coherent.
h3. Examples/resources

Useful for demonstrating the feature, though the PR is fairly example-heavy 
relative to the core API addition.
----
h2. Merge readiness and risk
h3. Readiness

*Not ready for approval* yet.
h3. Risk

*Medium to high* because this modifies a central text-rendering class:
 * {{PDAbstractContentStream.showText(...)}} is core behavior
 * text encoding/subsetting paths are delicate
 * form rendering is affected too

h3. Blockers before merge
 # *Fix {{showText()}} state validation regression*
 # *Resolve or document the thread-safety contradiction*
 # Ideally add tests covering:
 ** {{showText()}} error behavior with and without glyph layout enabled
 ** glyph-layout output for combining marks / DIN 91379 sample cases
 ** AcroForm appearance generation with the processor enabled

----
h2. Possible improvements
 * *API naming/docs cleanup:* {{GlyphLayoutProcessor}} is more than a 
processor; it also owns font loading and stream state.
 * *Typos/comments:* there are a few small comment/doc issues like “lyph layout 
processor”, “specofierd”, “embesSubset”.
 * *State minimization:* {{newLineAtOffsetBasic(...)}} is added but not used in 
the visible patch; if unused, avoid expanding API surface unnecessarily.
 * *Resource handling:* {{loadFont(...)}} copies the full font stream into 
memory. That may be acceptable, but deserves a comment since it’s done to 
initialize both PDFBox and AWT font representations.

----
h2. Recommendation

{*}Request changes{*}, specifically on:
 # preserving {{showText()}} precondition behavior
 # clarifying/fixing thread-safety of {{GlyphLayoutProcessor}}

If you want, I can next do one of these:
 # *Draft exact review comments* you can paste on the PR
 # *Deep-dive a specific file* like {{PDAbstractContentStream.java}}
 # *Look for missing tests / propose test cases*
 # *Summarize the PR in maintainer-friendly approval language once blockers are 
fixed*

> Sequences of DIN SPEC 91379 with combining letters are rendered incorrectly
> ---------------------------------------------------------------------------
>
>                 Key: PDFBOX-4951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering
>    Affects Versions: 2.0.21
>            Reporter: Volker Kunert
>            Priority: Major
>         Attachments: DIN_SPEC_91379_Sequences-aa.pdf, 
> DIN_SPEC_91379_Sequences-ab.pdf, DIN_SPEC_91379_Sequences-ac.pdf, 
> DIN_SPEC_91379_Sequences.txt, DefaultScriptProcessor.java, 
> DoGlyphLayoutDinSpec91379.pdf, DoGlyphLayoutDinSpec91379Form.pdf, 
> DoGlyphPositionBengali.pdf, ExamplePdfboxFopPos-By-Tilman.pdf, 
> ExamplePdfboxFopPos.java, ExamplePdfboxFopPos.pdf, 
> ExamplePdfboxFopPosForm.java, ExamplePdfboxFopPosForm.pdf, TestPdfbox.java, 
> TestPdfboxFop2.java, TestPdfboxFop2.pdf, TestPdfboxJava2D.java, 
> TestPdfboxJava2D.pdf, patch-2020-10-02.txt, pdfbox.patch, pdfbox.pdf, 
> screenshot-1.png
>
>
> Accented Letters composed of Unicode base letter and combining accent are 
> rendered wrong. E.g. with 0041 030B LATIN CAPITAL LETTER A WITH COMBINING 
> DOUBLE ACUTE ACCENT the accent appears at the right hand side of the letter 
> A, not above the letter A.
> The position is wrong for most of the sequences defined in the following spec:
> DIN SPEC 91379: Characters in Unicode for the electronic processing of names 
> and data 
>  exchange in Europe; with digital attachment
>  [https://www.xoev.de/downloads-2316#StringLatin]
>  [https://www.din.de/de/wdc-beuth:din21:301228458]
>  
> The correct rendering should look like the output of hb-view 2.6.8, see files 
> DIN_SPEC_91379_Sequences*.pdf.
> The output of PDFBox is appended in pdfbox.pdf, which is created by running 
> TestPdfbox.java. The sequences are read from file 
> DIN_SPEC_91379_Sequences.txt.
>  
> Font used for testing: NotoSansMono-Regular.ttf, see 
> [https://www.google.com/get/noto/] 
> download: 
> [https://noto-website-2.storage.googleapis.com/pkgs/NotoSansMono-hinted.zip]
>  See also FOP-2969
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to