[ 
https://issues.apache.org/jira/browse/PDFBOX-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4076:
------------------------------------
    Description: 
As reported by [~mkl] in his SO answer
{quote}The first error in PDF Name handling is that PDFBox internally 
represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. 
This is wrong, according to the PDF specification a name object is an atomic 
symbol uniquely defined by a sequence of any characters (8-bit values) except 
null (character code 0).

(...)

The second error is, though, that while serializing the PDF it only properly 
encodes the characters in the strings representing names which are from 
US_ASCII, all else are replaced by '?'
{quote}
sample code
{code:java}
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"),
 "äöüß");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
document.close();
document = PDDocument.load(baos.toByteArray());
System.out.println(document.getDocumentCatalog().getCOSObject().keySet());
document.close();
{code}
output:
{noformat}
[COSName{Type}, COSName{Version}, COSName{Pages}, COSName{????}]
{noformat}

  was:
As reported by ~mkl in SO answer

{quote}The first error in PDF Name handling is that PDFBox internally 
represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. 
This is wrong, according to the PDF specification a name object is an atomic 
symbol uniquely defined by a sequence of any characters (8-bit values) except 
null (character code 0).

(...)

The second error is, though, that while serializing the PDF it only properly 
encodes the characters in the strings representing names which are from 
US_ASCII, all else are replaced by '?'{quote}

sample code

{code:java}
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"),
 "äöüß");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
document.close();
document = PDDocument.load(baos.toByteArray());
System.out.println(document.getDocumentCatalog().getCOSObject().keySet());
document.close();
{code}
output:


{noformat}
[COSName{Type}, COSName{Version}, COSName{Pages}, COSName{????}]
{noformat}



> PDFBox cannot properly handle PDF Name objects containing bytes with values 
> outside the US_ASCII range
> ------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4076
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4076
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.8
>            Reporter: Tilman Hausherr
>            Priority: Major
>
> As reported by [~mkl] in his SO answer
> {quote}The first error in PDF Name handling is that PDFBox internally 
> represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. 
> This is wrong, according to the PDF specification a name object is an atomic 
> symbol uniquely defined by a sequence of any characters (8-bit values) except 
> null (character code 0).
> (...)
> The second error is, though, that while serializing the PDF it only properly 
> encodes the characters in the strings representing names which are from 
> US_ASCII, all else are replaced by '?'
> {quote}
> sample code
> {code:java}
> PDDocument document = new PDDocument();
> PDPage page = new PDPage();
> document.addPage(page);
> document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"),
>  "äöüß");
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> document.save(baos);
> document.close();
> document = PDDocument.load(baos.toByteArray());
> System.out.println(document.getDocumentCatalog().getCOSObject().keySet());
> document.close();
> {code}
> output:
> {noformat}
> [COSName{Type}, COSName{Version}, COSName{Pages}, COSName{????}]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to