[jira] [Commented] (PDFBOX-5230) Zero-width non-joiner characters visible in generated PDF

Daniel Gredler (Jira) Thu, 27 Feb 2025 06:21:19 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931182#comment-17931182
 ]


Daniel Gredler commented on PDFBOX-5230:
----------------------------------------

Thanks for all the feedback!

> Re the PR, some parts I'll do and some I won't.

Makes sense, I saw that you already incorporated the parts that made sense, so 
I've closed the PR in GitHub.

> you're removing the optimization done in PDFBOX-5823

Yes, but this optimization was of the if-check, not the resultant logic, no? I 
think the question is whether the logic inside the if-block is significantly 
faster / more memory-efficient than the logic in the else-block. The current 
logic does avoid the instantiation of at least 3 objects when the word is a 
space, though (I think).

> I don't think that split() would ever return null, but obviously it was there 
> for a reason

Yeah, it seems impossible to me, but I've been wrong before!

> This special treatment of ZW codes might break use cases

The proposed change doesn't really break anything though, does it? What I mean 
is that today these four zero-width characters, with their original glyphs, are 
making it into the PDFs generated by PDFBox. The only change is to make sure 
that if PDFBox is in charge of the PDF creation, that the glyphs we add are 
indeed zero width and zero contour (invisible). Copy/pasting the text, or 
programmatically extracting it, will indeed still include these characters – 
but that's the same behavior as today, and I would argue probably the desired 
behavior. Let me know if that makes sense, or if I misunderstood part of the 
problem!

Here is my first pass at a fix for this issue: 
https://github.com/apache/pdfbox/pull/203 I tried to add the necessary 
flexibility to `fontbox` without changing any default behavior, and only change 
the default behavior in `pdfbox`. The new test in `fontbox` passes, but the new 
test in `pdfbox` doesn't, and I'm not sure why. For some reason the widths 
aren't coming back as zero for U+200C. It seems like there's some sort of char 
/ gid / cid mapping issue and it just returns the default font char width 
(1000). Any suggestions?

> Zero-width non-joiner characters visible in generated PDF
> ---------------------------------------------------------
>
>                 Key: PDFBOX-5230
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5230
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox, PDModel, Writing
>    Affects Versions: 2.0.16
>            Reporter: Daniel Gredler
>            Priority: Major
>         Attachments: Af.pdf, zwnj-pdfkit.pdf, zwnj.pdf, zwnj.png
>
>
> I'd like to use the [zero-width 
> non-joiner|https://en.wikipedia.org/wiki/Zero-width_non-joiner] (ZWNJ) 
> character to prevent character shaping in some cases when using Arabic and 
> Indic scripts. This works correctly using some fonts like Arial Unicode 
> (character shaping is prevented and no ZWNJ glyph is visible in the PDF), but 
> does not work correctly when using fonts like Tahoma or Google Noto Sans 
> Regular, where the ZWNJ character is visible in the PDF. The ZWNJ glyph is 
> not visible when using these fonts in other programs, like Microsoft Word.
> I suspect that the `advanceWidth` settings in the `hmtx` table should be 
> taken into account somehow but are not, because the `advanceWidth` for this 
> glyph is 0 in both of these fonts which are erroneously generating visual 
> artifacts for the ZWNJ character (Tahoma and Google Noto Sans Regular).
> Test case generating the attached PDF file:
> {code:java}
> public class ZwnjTest {
>     public static void main(String[] args) throws IOException {
>         try (PDDocument document = new PDDocument()) {
>             PDPage page = new PDPage(PDRectangle.LETTER);
>             document.addPage(page);
>             try (PDPageContentStream stream = new 
> PDPageContentStream(document, page)) {
>                 // Tahoma: ZWNJ glyph is a vertical bar, but advanceWidth in 
> hmtx table is 0 -> shown in PDF anyway (unexpected)
>                 PDFont tahoma = PDType0Font.load(document, new 
> File("C:/Windows/Fonts/tahoma.ttf"));
>                 stream.beginText();
>                 stream.setFont(tahoma, 20);
>                 stream.newLineAtOffset(50, 650);
>                 stream.showText("t\u200Ce\u200Cs\u200Ct\u200C \u200C1"); // 
> U+200C = zero width non-joiner
>                 stream.endText();
>                 // Arial Unicode: ZWNJ glyph contains no outline -> not shown 
> in PDF (as expected)
>                 PDFont arialu = PDType0Font.load(document, new 
> File("C:/Windows/Fonts/ARIALUNI.TTF"));
>                 stream.beginText();
>                 stream.setFont(arialu, 20);
>                 stream.newLineAtOffset(50, 600);
>                 stream.showText("t\u200Ce\u200Cs\u200Ct\u200C \u200C2"); // 
> U+200C = zero width non-joiner
>                 stream.endText();
>                 // Google Noto Sans Regular: ZWNJ glyph is a vertical bar, 
> but advanceWidth in hmtx table is 0 -> shown in PDF anyway (unexpected)
>                 PDFont gnotos = PDType0Font.load(document, new 
> File("noto-sans-regular.ttf"));
>                 stream.beginText();
>                 stream.setFont(gnotos, 20);
>                 stream.newLineAtOffset(50, 550);
>                 stream.showText("t\u200Ce\u200Cs\u200Ct\u200C \u200C3"); // 
> U+200C = zero width non-joiner
>                 stream.endText();
>             }
>             document.save("zwnj.pdf");
>         }
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5230) Zero-width non-joiner characters visible in generated PDF

Reply via email to