Andrea Vacondio created PDFBOX-6168:
---------------------------------------
Summary: Suppoert UTF-8 encoded strings as specified in PDF 2.0
Key: PDFBOX-6168
URL: https://issues.apache.org/jira/browse/PDFBOX-6168
Project: PDFBox
Issue Type: Improvement
Components: IO, Text extraction
Reporter: Andrea Vacondio
Attachments: utf8.patch
PDF 2.0 added the possibility to have UTF-8 encoded strings. A new Byte Order
Marker (BOM) was added to identify UTF-8 encoded strings so the three-byte
sequence 239, 187, 191 (0xEF, 0xBB, 0xBF) can be used to identify a UTF-8
encoded string.
There is also a test file where outline names and info dictionary items are
UTF-8 encoded
https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf
The patch is very simple but maybe it's worth adding some test.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]