[ 
https://issues.apache.org/jira/browse/PDFBOX-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932934#comment-17932934
 ] 

Matti Oinas commented on PDFBOX-4728:
-------------------------------------

[https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf]

"As stated above, name objects are treated as atomic symbols within a PDF file.
Ordinarily, the bytes making up the name are never treated as text to be 
presented
to a human user or to an application external to a PDF consumer. 
....
Note: PDF does not prescribe what UTF-8 sequence to choose for representing any
given piece of externally specified text as a name object. In some cases, 
multiple
UTF-8 sequences could represent the same logical text. Name objects defined by 
dif-
ferent sequences of bytes constitute distinct name objects in PDF, even though 
the
UTF-8 sequences might have identical external interpretations."

I interpret that in the way that the name should be kept as sequence of bytes 
read and never do any decoding to it. Decoding should only be done when the 
name should be shown to human, for example in debugger app. That way the name 
would be unchanged between load and save just because there is no decode and 
encode operations to mess with original bytes.

> Broken PDF after load and save
> ------------------------------
>
>                 Key: PDFBOX-4728
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4728
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Writing
>    Affects Versions: 2.0.18, 3.0.0 PDFBox, 3.0.4 PDFBox
>            Reporter: Matti Oinas
>            Priority: Major
>         Attachments: PDFBOX-4728.patch, image-2025-03-06-07-28-20-426.png
>
>
> If read was done using WINDOWS-1252 charset and writing is done using
> UTF-8 then resulting PDF will be broken after load and save operations.
> {{PDDocument document = PDDocument.load(sourcePath);}}
> {{document.save(targetPath);}}
> If source PDF contains XObject dictionary reference whose name isn't
> encoded in UTF-8. For example.
> /L#f8vetann 16 0 R
> That is read using WINDOWS-1252 encoding. Now if write operation is
> using UTF-8 then the resulting name will be
> /L#3Fvetann 16 0 R
> And resulting PDF is broken and image is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to