[jira] [Commented] (PDFBOX-283) Character encoding/appearance issues when filling forms

Marco Primiceri (JIRA) Tue, 08 Jul 2014 02:49:48 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054707#comment-14054707
 ]


Marco Primiceri commented on PDFBOX-283:
----------------------------------------

Hi [~tilman]

Thanks a lot for your quick reply!
I apologise for the inconvenience but I realised too late that the change I 
have suggested would not have any effect as any tag representing a newline 
would be converted to hex string hence would not be printed out.

{code}
    printWriter.println("<" + new COSString(value).getHexString() + "> Tj");
{code}
where *value* contains the multi-line conversion mentioned in my previous post
{code}
    result.append(" > Tj\n0 -13 Td\n<");
{code}


* In order to solve this issue the method *insertGeneratedAppearance()* would 
need to tokenize the value based on the new line tag applied during the multi 
line conversion and then print the hex string for each line.
This solution is not ideal, as I am splitting the value on the newline tag, but 
I have done it anyway for completeness.
Please see attached diff file 
[*PDAppearance.diff*|https://docs.google.com/file/d/0B78-Rnr4JQ8AN0xpaE94bkZVakE/edit?pli=1]

* In my opinion a better solution would be to apply the multi-line conversion 
only when printing the value in *insertGeneratedAppearance()* rather than 
storing the converted string.
Basically, I noticed that the multi line conversion happens at the beginning of 
*setAppearanceValue()* but it does not really need to be there: the new line 
tag ("> Tj\n0 -13 Td\n<") is only relevant when printing out *value* in 
*insertGeneratedAppearance()* (also, it might actually have a negative impact 
on the font size calculation which is based on the value's length).
I have written a second, cleaner, solution which does the multi-line conversion 
on the fly when printing out *value*: please see attached diff file 
[*PDAppearance_bis.diff*|https://docs.google.com/file/d/0B78-Rnr4JQ8Ac1JiZE1DOUpUTlE/edit?pli=1]

I have tested a snapshot for both solutions and they both work fine.
Please let me know if you need more details about this.

Kind Regards,
Marco

> Character encoding/appearance issues when filling forms
> -------------------------------------------------------
>
>                 Key: PDFBOX-283
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-283
>             Project: PDFBox
>          Issue Type: Bug
>          Components: AcroForm
>         Attachments: PDAppearance.patch
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1735902
> Originally submitted by scop on 2007-06-12 10:23.
> When filling a text field with non-ASCII characters such as in my surname 
> "SkyttÃ¤" and saving the document in a UTF-8 environment, something goes 
> wrong with the appearance of the text.
> The value itself seems to be stored correctly, but when opening the doc, the 
> appearance of "Ã¤" is not that, but rather something which happens when UTF-8 
> is mistakenly treated as ISO-8859-1 (two garbage characters).
> PDAppearance uses the platform default encoding in quite a few places which 
> apparently has potential to mess things up.  In particular, 
> insertGeneratedAppearance() generates a PrintWriter from an OutputStream 
> without specifying the encoding.  In fact, if I hack that to use ISO-8859-1, 
> the appearance of my "Ã¤" case is correct, but that won't obviously work with 
> anything else than chars that are valid ISO-8859-1.
> In which char encoding should the value be written to the appearance stream 
> (at end of insertGeneratedAppearance())?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-283) Character encoding/appearance issues when filling forms

Reply via email to