[
https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243685#comment-14243685
]
John Hewson commented on PDFBOX-1242:
-------------------------------------
Ok, so you're doing effectively doing a WinAnsiEncoding and doing some
automated transliteration for other languages.
Looking at the code for drawString, one key problem is that COSString should
not be used at all, as it is for the "text strings" and "byte strings" which
appear in PDF dictionaries, but not "content strings" which appear in the
content stream, which is why the BOM appears in the text as "þÿ". The other
problem is that the font's encoding needs to be used, see PDFBOX-922.
> Handle non ISO-8859-1 chars with drawString
> -------------------------------------------
>
> Key: PDFBOX-1242
> URL: https://issues.apache.org/jira/browse/PDFBOX-1242
> Project: PDFBox
> Issue Type: Bug
> Components: Writing
> Affects Versions: 1.5.0, 1.6.0
> Reporter: Peter Andersen
> Assignee: John Hewson
> Fix For: 2.0.0
>
>
> The PDPageContentStream.drawString take a String as argument, it construct a
> COSString of the input.
> If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff
> and the bytes are taken from the
> input as "UTF-16BE" encoded.
> Back in the drawString method this unicode16 encoded COSString is appended as
> a "ISO-8859-1"
> appendRawCommands( new String( buffer.toByteArray(), "ISO-8859-1"));
>
> The result of this is that a line with UTF-16 chars is shown prefix with þÿ,
> and with double space between the other chars.
> The chars above 255 are shown as the two corresponding ISO-8859-1 characters.
> As a side question to this observation, is there an alternative way to use
> Pdfbox, to support UTF16?
>
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)