[jira] [Created] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

Matthew Self (JIRA) Sat, 21 Oct 2017 16:22:41 -0700

Matthew Self created PDFBOX-3975:
------------------------------------

             Summary: ExtractText converts some diacritics to combining forms 
that don't get combined
                 Key: PDFBOX-3975
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3975
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.7
            Reporter: Matthew Self



When I use ExtractText on the file 
http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf, 
there is an issue with the "^" character on page 15.

The extracted text is "special characters ( * ! & } ̂  % and so on ) . )".

Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT) 
when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).

I believe that what is happening is the original U+005E character is being 
converted to U+0302 by the DIACRITICS map in TextPosition.java:

        map.put(0x005e, "\u0302");

This is probably because the character slightly overlaps the preceding space 
character.  But then this combining diacritic can't be combined with space 
character, so the extracted text contains the combining character instead of 
the original.

One solution would be to tighten up the detection of overlaps so that 
combineDiacritic() is not called in this instance.

Another (perhaps more robust) solution would be to verify in combineDiacritic() 
that the call to Normalizer.normalize() actually does combine the combining 
form of the diacritic with the previous character.  If the result of calling 
Normalizer.normalize() has more than one character in it, then the diacritic 
must not have been combined with the previous character.  In that case, the 
diacritic should not be merged.

The goal would be for the extracted text to never contain combining characters 
that failed to combine.

P.S.  Thank you for the great library of PDFBox!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

Reply via email to