[
https://issues.apache.org/jira/browse/PDFBOX-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543411#comment-16543411
]
Tilman Hausherr commented on PDFBOX-4265:
-----------------------------------------
Sadly, you'll have to contact the creator of that file, or use OCR. I looked at
the file with PDFDebugger and most of the fonts don't have unicode values for
the glyphs, or they have wrong ones. (That's why you got these "No Unicode
mapping for G4ec5 (7) in font TTEEo00" messages).
That is why Adobe can't extract text either.
> Not able to extract text from Japanese PDF
> ------------------------------------------
>
> Key: PDFBOX-4265
> URL: https://issues.apache.org/jira/browse/PDFBOX-4265
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.2
> Environment: Windows 10, Region settings set to Japanese
> Reporter: Viral Valand
> Priority: Critical
> Attachments: CommandLine.txt, jpn.pdf, jpn.txt
>
>
> Not able to extract text from Japanese PDF attached(jpn.pdf).
> Although, it works well with another Japanese PDF.
>
> Also, Is there any overloaded method that accepts Encoding for text
> extraction? If yes, please let us know.
>
> Thank you.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]