Matthew Self created PDFBOX-3975:
------------------------------------
Summary: ExtractText converts some diacritics to combining forms
that don't get combined
Key: PDFBOX-3975
URL: https://issues.apache.org/jira/browse/PDFBOX-3975
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.7
Reporter: Matthew Self
When I use ExtractText on the file
http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf,
there is an issue with the "^" character on page 15.
The extracted text is "special characters ( * ! & } ̂ % and so on ) . )".
Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT)
when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).
I believe that what is happening is the original U+005E character is being
converted to U+0302 by the DIACRITICS map in TextPosition.java:
map.put(0x005e, "\u0302");
This is probably because the character slightly overlaps the preceding space
character. But then this combining diacritic can't be combined with space
character, so the extracted text contains the combining character instead of
the original.
One solution would be to tighten up the detection of overlaps so that
combineDiacritic() is not called in this instance.
Another (perhaps more robust) solution would be to verify in combineDiacritic()
that the call to Normalizer.normalize() actually does combine the combining
form of the diacritic with the previous character. If the result of calling
Normalizer.normalize() has more than one character in it, then the diacritic
must not have been combined with the previous character. In that case, the
diacritic should not be merged.
The goal would be for the extracted text to never contain combining characters
that failed to combine.
P.S. Thank you for the great library of PDFBox!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]