Double byte Unicode Char incorrect when parsed

Fung Cheung Mon, 17 Feb 2020 01:48:09 -0800

Hello,

We have been using FOP for html to pdf generation, and recently I noticed
that when unicode chars are included, the pdf output has parsing issues.


How we use it (fop 2.4):
- We have html string
- final InputSource source = new InputSource(new
ByteArrayInputStream(htmlString.getBytes()));

FopFactoryBuilder builder = new
FopFactoryBuilder(URI.create(resourceLoader.getResource(resourceBasePath).getURI().toString()),
new ClasspathResolverURIAdapter());
builder.setConfiguration(configuration);

FopFactory factory = builder.build();
userAgent = factory.newFOUserAgent();
userAgent.setAuthor("Indeed");
userAgent.setCreator("Indeed");
userAgent.setTitle("Indeed");
userAgent.setKeywords("Indeed");

fop = factory.newFop(MimeConstants.MIME_PDF, userAgent, outputStream);

// Setup CSSToXSLFo as transform the XHTML output into xml:fo
final URL baseUrl = resourceLoader.getResource(resourceBasePath).getURL();
Loggers.debug(LOGGER, "Parsing HTML response using base URL '%s'", baseUrl);
final XMLReader xmlParser = Util.getParser(null, isValidatingParser);
final ProtectEventHandlerFilter eventHandlerFilter = new
ProtectEventHandlerFilter(true, true, xmlParser);

final XMLReader filter =
        new CSSToXSLFOFilter(
                baseUrl,
                null,
                Collections.EMPTY_MAP,
                eventHandlerFilter,
                cssToXslFoDebugEnabled);

filter.setEntityResolver(classPathEntityResolver);
filter.setContentHandler(fop.getDefaultHandler());
filter.parse(source);


This is able to produce a PDF with all the right displayed chars. As in, it
looks correct to a human.

We have a use case of reading it programatically. We are testing it out
with selecting the text in Adobe Reader, copying and pasting it. This
output is the same as parsing tools like pdftotext & pdfbox.

However, when there are many unicode chars, 3 things happen when we copy:
1) some unicode chars are copied as some other random chars
e.g. source: 😂😂😂😂😂 🃋🃋🃋🃋🃋 🃋🃋🃋 jack 𝍐𝍐𝍐 3 chars 𝄞𝄞 2 music
majhog : 🀤 🀤 🀤
copy output: 😂😂😂😂😂 33333 333🃋 jack🃋 555   chars🃋 𝍐 2 music🃋
majhog 8 3 3 3

2) Location on the chars
e.g. From the above example, the next page of the PDF does not have 🃋.
However when copying, it showed up "🃋🃋🃋" somewhere on the next page.

3) Some fonts make corrupted PDF output. We were trying out Mathematical
fonts, e.g. "𝐏𝐫𝐨𝐟𝐢𝐥𝐞
<https://www.fileformat.info/info/unicode/char/1d40f/fontsupport.htm>"
It was fixable by using the Symbola font embedding-mode="full", where a
correct looking PDF is produced. However, copying "𝐏𝐫𝐨𝐟𝐢𝐥𝐞" gave
"퐏퐫퐨퐟퐢퐥퐞". Upon comparing, the Capital P char is U+1D40F
<https://codepoints.net/U+1D40F>, and the corresponding Korean char is
U+D40F <https://codepoints.net/U+D40F>. The 1 in front of it is missing.

It was frustrating and I have googled everywhere. It seems to be related to
how Fop handles the toUnicodeCmap from a font file. I confirmed by
producing the PDF using Weasyprint (Python library), where all chars are
copy-able correctly.

Are we using FOP incorrectly? Are there tweaks we can do to fix it?

Thanks so much!

Double byte Unicode Char incorrect when parsed

Reply via email to