[ 
https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243706#comment-14243706
 ] 

John Hewson edited comment on PDFBOX-1242 at 12/12/14 4:51 AM:
---------------------------------------------------------------

This rather large commit removes the usage of COSString when handling content 
streams. It also overhauls the internals of COSString, restricting where and 
when encoding or decoding occurs. There are actually three types of COSString 
in the PDF spec "text strings" "ascii strings" and "byte strings", all of which 
COSString now handles. The previous code assumed that all strings were "text 
strings" which have PDFDocEncoding or UTF-16BE, but this is not the case.

However, the modified drawString method still uses ISO-8859-1 and needs to be 
changed to use the font's encoding. For simple encodings this is being 
addressed in PDFBOX-922, and for full Unicode in PDFBOX-2524. I'm going to 
resolve this issue as "Fixed", as there was work done here, but remaining work 
will be done in those other two issues.


was (Author: jahewson):
This rather large commit removes the usage of COSString when handling content 
streams. It also overhauls the internals of COSString, restricting where and 
when encoding or decoding occurs. There are actually three types of COSString 
in the PDF spec "text strings" "ascii strings" and "byte strings", all of which 
COSString now handles. The previous code assumed that all strings were "text 
strings" which have PDFDocEncoding or UTF-16BE, but this is not the case.

However, the modified drawString method still uses ISO-8859-1 and needs to be 
changed to use the font's encoding. For simple encodings this is being 
addressed in PDFBOX-922, and for full Unicode in PDFBOX-2524. I'm going to 
resolve this issue as "Fixed", as there was work done here, and the work will 
be done in those other two issues.

> Handle non ISO-8859-1 chars with drawString
> -------------------------------------------
>
>                 Key: PDFBOX-1242
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1242
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Writing
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Peter Andersen
>            Assignee: John Hewson
>             Fix For: 2.0.0
>
>
> The PDPageContentStream.drawString take a String as argument, it construct a 
> COSString of the input.
> If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff 
> and the bytes are taken from the
> input as "UTF-16BE" encoded.
> Back in the drawString method this unicode16 encoded COSString is appended as 
> a "ISO-8859-1"        
>       appendRawCommands( new String( buffer.toByteArray(), "ISO-8859-1"));
>  
> The result of this is that a line with UTF-16 chars is shown prefix with þÿ, 
> and with double space between the other chars.
> The chars above 255 are shown as the two corresponding ISO-8859-1 characters.
> As a side question to this observation, is there an alternative way to use 
> Pdfbox, to support UTF16?
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to