[ 
https://issues.apache.org/jira/browse/PDFBOX-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Self updated PDFBOX-3975:
---------------------------------
    Comment: was deleted

(was: Looking more closely at the code, I see that mergeDiacritic() isn't 
actually merging the base character and the diacritic into its NFC form (a 
single character), but rather leaving it in NFD form (the base char followed by 
combining diacritic).

For example, I have a PDF document that contains the name "Krkošek".  In the 
Tj, this consists of "s" followed by U+02C7 (CANON), which will be displayed as 
two characters in a text editor.  The output of ExtractText is "s" followed 
U+030C (COMBINING CANON).  This is valid UTF-8 and will display correctly in a 
text editor, but it is in NFD form rather than NFC form.  The desired output 
would be the single character U+0161 (LATIN SMALL LETTER S WITH CANON), which 
is the same Unicode string but in NFC form.

My suggestion would be to rework this code so that instead of just converting 
the diacritics from stand-alone form to combining form, it also uses 
Normalizer.Form.NFC() to combine the base character and the diacritic.  If this 
results in a single character, then the output is in the desired NFC form.  If 
this results in no change to the string, then mergeDiacritic() should not merge 
the characters (even though they appear to overlap) and leave the diacritic 
character in its original (stand-alone) form.

This would fix both issues (unwanted conversion of U+005E to U+0302 and failure 
to produce the NFC form U+0161).)

> ExtractText converts some diacritics to combining forms that don't get 
> combined
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3975
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3975
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>            Reporter: Matthew Self
>
> When I use ExtractText on the file 
> http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf,
>  there is an issue with the "^" character on page 15.
> The extracted text is "special characters ( * ! & } ̂  % and so on ) . )".
> Note that the extracted "^" character is U+0302 (COMBINING CIRCUMFLEX ACCENT) 
> when it ought to be plain old U+005E (CIRCUMFLEX ACCENT).
> I believe that what is happening is the original U+005E character is being 
> converted to U+0302 by the DIACRITICS map in TextPosition.java:
>         map.put(0x005e, "\u0302");
> This is probably because the character slightly overlaps the preceding space 
> character.  But then this combining diacritic can't be combined with space 
> character, so the extracted text contains the combining character instead of 
> the original.
> One solution would be to tighten up the detection of overlaps so that 
> combineDiacritic() is not called in this instance.
> Another (perhaps more robust) solution would be to verify in 
> combineDiacritic() that the call to Normalizer.normalize() actually does 
> combine the combining form of the diacritic with the previous character.  If 
> the result of calling Normalizer.normalize() has more than one character in 
> it, then the diacritic must not have been combined with the previous 
> character.  In that case, the diacritic should not be merged.
> The goal would be for the extracted text to never contain combining 
> characters that failed to combine.
> P.S.  Thank you for the great library of PDFBox!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to