[jira] [Created] (PDFBOX-4076) PDFBox cannot properly handle PDF Name objects containing bytes with values outside the US_ASCII range

Tilman Hausherr (JIRA) Sat, 20 Jan 2018 08:45:48 -0800

Tilman Hausherr created PDFBOX-4076:
---------------------------------------


             Summary: PDFBox cannot properly handle PDF Name objects containing 
bytes with values outside the US_ASCII range
                 Key: PDFBOX-4076
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4076
             Project: PDFBox
          Issue Type: Bug
            Reporter: Tilman Hausherr


As reported by ~mkl in SO answer

{quote}The first error in PDF Name handling is that PDFBox internally 
represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. 
This is wrong, according to the PDF specification a name object is an atomic 
symbol uniquely defined by a sequence of any characters (8-bit values) except 
null (character code 0).

(...)

The second error is, though, that while serializing the PDF it only properly 
encodes the characters in the strings representing names which are from 
US_ASCII, all else are replaced by '?'{quote}

sample code

{code:java}
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"),
 "äöüß");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
document.close();
document = PDDocument.load(baos.toByteArray());
System.out.println(document.getDocumentCatalog().getCOSObject().keySet());
document.close();
{code}
output:


{noformat}
[COSName{Type}, COSName{Version}, COSName{Pages}, COSName{????}]
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-4076) PDFBox cannot properly handle PDF Name objects containing bytes with values outside the US_ASCII range

Reply via email to