[
https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-3782:
------------------------------------
Attachment: PDFBOX-3782-reduced.pdf
> WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho
> --------------------------------------------------------------------
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.4
> Environment: Java/Tika
> Reporter: Tony Bray
> Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing
> system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system -
> Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content. In
> several areas, the content extracted loses the whitespace, causing a
> tokenization problem for indexing/searching.
> I have attached the original document and the text output. If you search
> (Ctrl+f) the text document for "Another example". Here you will see no space
> after "is" and the Japanese text. The same issue shows for
> "whichmeans"eraser"" at the end of the sentence.
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font
> RGOFPX+IPAexMincho" during extraction but have been unable to find any
> information on it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]