[ https://issues.apache.org/jira/browse/PDFBOX-5230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17929808#comment-17929808 ]
Daniel Gredler edited comment on PDFBOX-5230 at 2/24/25 8:49 PM: ----------------------------------------------------------------- [~tilman] I wanted to have another look at this one, and perhaps create a fix PR. I think I'll use the "Save as PDF" functionality in LibreOffice Writer and in the online version of Microsoft Office to check their behavior, and if their behavior matches, assume that's "best practice" (or at least a good starting point). Does this sound like a good foundation for a possible fix for this issue? Also, if you have any other thoughts or advice on this one, I'm all ears. *EDIT:* LibreOffice and MS Office actually behave slightly differently, unfortunately. So here's the approach I'm thinking about using for ZWNJ (and possibly also ZWSP, ZWJ, WJ and ZWNBSP): # If the font being used is one of the 14 built-in fonts (Times Roman, Helvetica, Courier, Symbol or Dingbats), it means that only Latin text is supported. For these fonts, remove the zero-width character from the text: it should not be rendered, and it usually serves no layout purpose in Latin text. # If the font being used is a custom subset font, customize the subsetting process to force the zero-width character glyph to actually be zero-width and zero-contour / zero-path, regardless of whether the source font defines a glyph with a non-zero width and/or contour paths. This way if e.g. ZWNJ is used to break up an Arabic ligature, the in-PDF text still contains the ZWNJ and behaves correctly, but there is no risk of an extra unwanted glyph being rendered. Thoughts? I can put together a PR, if you'd like to see what this would look like. was (Author: sdanig): [~tilman] I wanted to have another look at this one, and perhaps create a fix PR. I think I'll use the "Save as PDF" functionality in LibreOffice Writer and in the online version of Microsoft Office to check their behavior, and if their behavior matches, assume that's "best practice" (or at least a good starting point). Does this sound like a good foundation for a possible fix for this issue? Also, if you have any other thoughts or advice on this one, I'm all ears. > Zero-width non-joiner characters visible in generated PDF > --------------------------------------------------------- > > Key: PDFBOX-5230 > URL: https://issues.apache.org/jira/browse/PDFBOX-5230 > Project: PDFBox > Issue Type: Bug > Components: FontBox, PDModel, Writing > Affects Versions: 2.0.16 > Reporter: Daniel Gredler > Priority: Major > Attachments: Af.pdf, zwnj-pdfkit.pdf, zwnj.pdf, zwnj.png > > > I'd like to use the [zero-width > non-joiner|https://en.wikipedia.org/wiki/Zero-width_non-joiner] (ZWNJ) > character to prevent character shaping in some cases when using Arabic and > Indic scripts. This works correctly using some fonts like Arial Unicode > (character shaping is prevented and no ZWNJ glyph is visible in the PDF), but > does not work correctly when using fonts like Tahoma or Google Noto Sans > Regular, where the ZWNJ character is visible in the PDF. The ZWNJ glyph is > not visible when using these fonts in other programs, like Microsoft Word. > I suspect that the `advanceWidth` settings in the `hmtx` table should be > taken into account somehow but are not, because the `advanceWidth` for this > glyph is 0 in both of these fonts which are erroneously generating visual > artifacts for the ZWNJ character (Tahoma and Google Noto Sans Regular). > Test case generating the attached PDF file: > {code:java} > public class ZwnjTest { > public static void main(String[] args) throws IOException { > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(PDRectangle.LETTER); > document.addPage(page); > try (PDPageContentStream stream = new > PDPageContentStream(document, page)) { > // Tahoma: ZWNJ glyph is a vertical bar, but advanceWidth in > hmtx table is 0 -> shown in PDF anyway (unexpected) > PDFont tahoma = PDType0Font.load(document, new > File("C:/Windows/Fonts/tahoma.ttf")); > stream.beginText(); > stream.setFont(tahoma, 20); > stream.newLineAtOffset(50, 650); > stream.showText("t\u200Ce\u200Cs\u200Ct\u200C \u200C1"); // > U+200C = zero width non-joiner > stream.endText(); > // Arial Unicode: ZWNJ glyph contains no outline -> not shown > in PDF (as expected) > PDFont arialu = PDType0Font.load(document, new > File("C:/Windows/Fonts/ARIALUNI.TTF")); > stream.beginText(); > stream.setFont(arialu, 20); > stream.newLineAtOffset(50, 600); > stream.showText("t\u200Ce\u200Cs\u200Ct\u200C \u200C2"); // > U+200C = zero width non-joiner > stream.endText(); > // Google Noto Sans Regular: ZWNJ glyph is a vertical bar, > but advanceWidth in hmtx table is 0 -> shown in PDF anyway (unexpected) > PDFont gnotos = PDType0Font.load(document, new > File("noto-sans-regular.ttf")); > stream.beginText(); > stream.setFont(gnotos, 20); > stream.newLineAtOffset(50, 550); > stream.showText("t\u200Ce\u200Cs\u200Ct\u200C \u200C3"); // > U+200C = zero width non-joiner > stream.endText(); > } > document.save("zwnj.pdf"); > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org