[
https://issues.apache.org/jira/browse/PDFBOX-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278628#comment-15278628
]
John Hewson commented on PDFBOX-3347:
-------------------------------------
Note that parsing COSName as UTF-8 is correct. mkl makes good points about how
we compare and write out COSName (these should indeed be byte-based, like
COSString is in 2.0). But that's not the issue here. The SO user is not correct
in saying that the dictionary keys are ISO-8859-1 encoded. ISO-8859-1 is not
used anywhere in the PDF spec.
Looking at the {{Krematorier}} field (26th in the Fields array) we see an
appearance stream (AP > N > 1) with a raw name of {{/R#E5cksta}}. Hex is an
escape character an PDFBox should be parsing #E5 as UTF-8, where it corresponds
to U+00E5 which is {{å}}. However' that's not happening.
> COSName parsing/writing interprets byte sequences as UTF-8 when parsing
> -----------------------------------------------------------------------
>
> Key: PDFBOX-3347
> URL: https://issues.apache.org/jira/browse/PDFBOX-3347
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing, Writing
> Affects Versions: 1.8.12, 2.0.1, 2.0.2
> Reporter: Maruan Sahyoun
> Priority: Minor
>
> As discussed here
> http://stackoverflow.com/questions/36964496/pdfbox-2-0-overcoming-dictionary-key-encoding/
> a byte sequence making up a COSName is interpreted during parsing and
> writing where it shouldn't. Details are given my mkl's excellent analysis.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]