Andrea Vacondio created PDFBOX-6168:
---------------------------------------

             Summary: Suppoert UTF-8 encoded strings as specified in PDF 2.0
                 Key: PDFBOX-6168
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6168
             Project: PDFBox
          Issue Type: Improvement
          Components: IO, Text extraction
            Reporter: Andrea Vacondio
         Attachments: utf8.patch

PDF 2.0 added the possibility to have UTF-8 encoded strings. A new Byte Order 
Marker (BOM) was added to identify UTF-8 encoded strings so the three-byte 
sequence 239, 187, 191 (0xEF, 0xBB, 0xBF) can be used to identify a UTF-8 
encoded string.
There is also a test file where outline names and info dictionary items are 
UTF-8 encoded 
https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf
The patch is very simple but maybe it's worth adding  some test.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to