[ 
https://issues.apache.org/jira/browse/PDFBOX-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-570:
-----------------------------------

    Summary: Wingdings font recognition + spacing issue  (was: Windings font 
recognition + spacing issue)

> Wingdings font recognition + spacing issue
> ------------------------------------------
>
>                 Key: PDFBOX-570
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-570
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: Windows XP / Java JDK 1.6.0_15 / Tika 0.4 with 
> PDFbox-0.7.3.jar and fontbox-0.1.0.jar embedded
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, 
> test2.pdf
>
>
> Windings characters issue
> -------------------------
> If filed this question first in Tika's wish list (tika-331) but Ken Krugler 
> suggest it was a PDFBox issue.
> I have PDF files that include some characters in Windings font. 
> Tika parser replaces them with some Unicode characters that have nothing to 
> do with the original, and, in some cases, replaces them with alphabetic 
> characters. That is normal regarding these characters codes inside Windings 
> font, but when hands pictures are replaced by alphabetic characters like A, 
> B, etc. that disturbs further lexical analysis.
> Would it be possible to improve the parsing and remplace these characters 
> with more accurate Unicode characters ? 
> (see http://www.alanwood.net/demos/wingdings.html for possible 
> correspondences). 
> Attached files :
> test1.pdf is a PDF file including Windings characters. Some are commonly used 
> by people, others less fequently. 
> Parsing_result1.txt is the text file produced by Tika.
> test2.pdf is another example with the same WORD source file converted into 
> PDF with another tool, and Parsing_result2.txt is the Tika parsing result. 
> Windings characters are translated into different Unicode characters than 
> with the previous version.
> Spacing issue 
> -------------
> Look at lines 10 and 11 in test2.pdf. 
> Look at lines 11 and 12 in Tika parsing result (Parsing_result2.txt) : 
> ðLocalisation des zones de livraison et de stockage 
> ðLocalisation des zones dangereuses 
> There is no space between ð and Localisation (ð is the translation of 
> Winding's "Rightwards white arrow" by Tika). 
> If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you 
> get : 
> ð Localisation des zones de livraison et de stockage 
> ð Localisation des zones dangereuses 
> ...with a space between ð and Localisation. 
> In my case, the missing space after Tika parsing result in considering 
> "ðLocalisation" as a single word in following analysis. 
> Regards



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to