[jira] [Updated] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

Tilman Hausherr (JIRA) Mon, 23 Oct 2017 10:14:38 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr updated PDFBOX-3975:
------------------------------------
    Labels: diacritics  (was: )

> ExtractText converts some diacritics to combining forms that don't get 
> combined
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3975
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3975
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>            Reporter: Matthew Self
>              Labels: diacritics
>
> When I use ExtractText on the file 
> http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf,
>  there is an issue with the "^" character on page 15.
> The extracted text is "special characters ( * ! & } ̂  % and so on ) . )".
> Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT) 
> when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).
> I believe that what is happening is the original U+005E character is being 
> converted to U+0302 by the DIACRITICS map in TextPosition.java:
>         map.put(0x005e, "\u0302");
> This is probably because the character slightly overlaps the preceding space 
> character.  But then this combining diacritic can't be combined with space 
> character, so the extracted text contains the combining character instead of 
> the original.
> One solution would be to tighten up the detection of overlaps so that 
> combineDiacritic() is not called in this instance.
> Another (perhaps more robust) solution would be to verify in 
> combineDiacritic() that the call to Normalizer.normalize() actually does 
> combine the combining form of the diacritic with the previous character.  If 
> the result of calling Normalizer.normalize() has more than one character in 
> it, then the diacritic must not have been combined with the previous 
> character.  In that case, the diacritic should not be merged.
> The goal would be for the extracted text to never contain combining 
> characters that failed to combine.
> P.S.  Thank you for the great library of PDFBox!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-3975) ExtractText converts some diacritics to combining forms that don't get combined

Reply via email to