[ 
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3782:
------------------------------------
    Attachment: PDFBOX-3782-reduced.pdf

> WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-3782
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3782
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.4
>         Environment: Java/Tika
>            Reporter: Tony Bray
>            Priority: Minor
>         Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing 
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - 
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In 
> several areas, the content extracted loses the whitespace, causing a 
> tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search 
> (Ctrl+f) the text document for "Another example".  Here you will see no space 
> after "is" and the Japanese text.  The same issue shows for 
> "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font 
> RGOFPX+IPAexMincho" during extraction but have been unable to find any 
> information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to