Tilman Hausherr created PDFBOX-4076:
---------------------------------------
Summary: PDFBox cannot properly handle PDF Name objects containing
bytes with values outside the US_ASCII range
Key: PDFBOX-4076
URL: https://issues.apache.org/jira/browse/PDFBOX-4076
Project: PDFBox
Issue Type: Bug
Reporter: Tilman Hausherr
As reported by ~mkl in SO answer
{quote}The first error in PDF Name handling is that PDFBox internally
represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy.
This is wrong, according to the PDF specification a name object is an atomic
symbol uniquely defined by a sequence of any characters (8-bit values) except
null (character code 0).
(...)
The second error is, though, that while serializing the PDF it only properly
encodes the characters in the strings representing names which are from
US_ASCII, all else are replaced by '?'{quote}
sample code
{code:java}
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"),
"äöüß");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
document.close();
document = PDDocument.load(baos.toByteArray());
System.out.println(document.getDocumentCatalog().getCOSObject().keySet());
document.close();
{code}
output:
{noformat}
[COSName{Type}, COSName{Version}, COSName{Pages}, COSName{????}]
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]