[ https://issues.apache.org/jira/browse/PDFBOX-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932934#comment-17932934 ]
Matti Oinas commented on PDFBOX-4728: ------------------------------------- [https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf] "As stated above, name objects are treated as atomic symbols within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a PDF consumer. .... Note: PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences could represent the same logical text. Name objects defined by dif- ferent sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences might have identical external interpretations." I interpret that in the way that the name should be kept as sequence of bytes read and never do any decoding to it. Decoding should only be done when the name should be shown to human, for example in debugger app. That way the name would be unchanged between load and save just because there is no decode and encode operations to mess with original bytes. > Broken PDF after load and save > ------------------------------ > > Key: PDFBOX-4728 > URL: https://issues.apache.org/jira/browse/PDFBOX-4728 > Project: PDFBox > Issue Type: Bug > Components: Parsing, Writing > Affects Versions: 2.0.18, 3.0.0 PDFBox, 3.0.4 PDFBox > Reporter: Matti Oinas > Priority: Major > Attachments: PDFBOX-4728.patch, image-2025-03-06-07-28-20-426.png > > > If read was done using WINDOWS-1252 charset and writing is done using > UTF-8 then resulting PDF will be broken after load and save operations. > {{PDDocument document = PDDocument.load(sourcePath);}} > {{document.save(targetPath);}} > If source PDF contains XObject dictionary reference whose name isn't > encoded in UTF-8. For example. > /L#f8vetann 16 0 R > That is read using WINDOWS-1252 encoding. Now if write operation is > using UTF-8 then the resulting name will be > /L#3Fvetann 16 0 R > And resulting PDF is broken and image is missing. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org