[jira] [Commented] (PDFBOX-4265) Not able to extract text from Japanese PDF

Tilman Hausherr (JIRA) Fri, 13 Jul 2018 09:33:15 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543411#comment-16543411
 ]


Tilman Hausherr commented on PDFBOX-4265:
-----------------------------------------

Sadly, you'll have to contact the creator of that file, or use OCR. I looked at 
the file with PDFDebugger and most of the fonts don't have unicode values for 
the glyphs, or they have wrong ones. (That's why you got these "No Unicode 
mapping for G4ec5 (7) in font TTEEo00" messages). 

That is why Adobe can't extract text either.

> Not able to extract text from Japanese PDF
> ------------------------------------------
>
>                 Key: PDFBOX-4265
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4265
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.2
>         Environment: Windows 10, Region settings set to Japanese
>            Reporter: Viral Valand
>            Priority: Critical
>         Attachments: CommandLine.txt, jpn.pdf, jpn.txt
>
>
> Not able to extract text from Japanese PDF attached(jpn.pdf).
> Although, it works well with another Japanese PDF.
>  
> Also, Is there any overloaded method that accepts Encoding for text 
> extraction? If yes, please let us know.
>  
> Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4265) Not able to extract text from Japanese PDF

Reply via email to