Tony Bray created PDFBOX-3782:
---------------------------------
Summary: WARNING: No Unicode mapping for CID+0 (0) in font
RGOFPX+IPAexMincho
Key: PDFBOX-3782
URL: https://issues.apache.org/jira/browse/PDFBOX-3782
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 2.0.4
Environment: Java/Tika
Reporter: Tony Bray
Priority: Minor
Attachments: Test doc - Japanese writing system - Kanji Hiragana
Katakana.pdf, Test doc - Japanese writing system - Kanji Hiragana Katakana.txt
I have a PDF document that I am using Tika/PDFBox to extract the content. In
several areas, the content extracted loses the whitespace, causing a
tokenization problem for indexing/searching.
I have attached the original document and the text output. If you search
(Ctrl+f) the text document for "Another example". Here you will see no space
after "is" and the Japanese text. The same issue shows for
"whichmeans"eraser"" at the end of the sentence.
Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font
RGOFPX+IPAexMincho" during extraction but have been unable to find any
information on it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]