[jira] [Commented] (PDFBOX-5358) Add support for UTF-8 in strings

Tilman Hausherr (Jira) Sat, 15 Jan 2022 11:13:06 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476680#comment-17476680
 ]


Tilman Hausherr commented on PDFBOX-5358:
-----------------------------------------

proposed change: add this to the beginning of COSString.toString
{code}
        if (bytes.length >= 3 &&
                (bytes[0] & 0xff) == 0xEF && (bytes[1] & 0xff) == 0xBB && 
(bytes[2] & 0xff) == 0xBF)
        {
            return new String(bytes, 3, bytes.length - 3, 
StandardCharsets.UTF_8);
        }
{code}
 !screenshot-1.png! 

> Add support for UTF-8 in strings
> --------------------------------
>
>                 Key: PDFBOX-5358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5358
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: Screen Shot 2022-01-06 at 9.18.09 AM.png, 
> screenshot-1.png
>
>
> Peter Wyatt recently published an article on UTF-8 strings in PDF 2.0: 
> [https://www.pdfa.org/understanding-utf-8-in-pdf-2-0/]
> The article includes a link to a test file he created: 
> [https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf]
>  
> Our debugger shows that we may need to add support for this (see attached).  
> This was with PDFBox 2.0.25.  I didn't have a chance to test with 3.x or the 
> 2.x snapshot.
> I don't think we're necessarily covering all the changes yet in PDF 2.0, but 
> I thought I'd open this issue for at least discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5358) Add support for UTF-8 in strings

Reply via email to