[jira] [Commented] (PDFBOX-4210) Unable to extract the text from a PDF ("No Unicode mapping.." warnings)

Maruan Sahyoun (JIRA) Tue, 08 May 2018 05:06:13 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16467306#comment-16467306
 ]


Maruan Sahyoun commented on PDFBOX-4210:
----------------------------------------

So both Reader and PDFBox have the same text extraction result as there is no 
text to extract. 

But what about the Copy & Paste result. This invisible text is meant for 
interactive search and was generated using OCR but is not part of the text 
content of this PDF. We could look at using such text - if it exists - as a 
fallback if, and only if, there is no text to extract, but strictly speaking 
this PDF has no extractable text (according to the PDF spec).

If we do this I would suggest that it doesn't become a part of the 'core' text 
extraction but should be implemented as a utility/application as this is more 
application oriented.

[~aputnik] You've written in the other issue that you are using OCR as a 
fallback. Which tool are you using for that? 

> Unable to extract the text from a PDF ("No Unicode mapping.." warnings)
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-4210
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4210
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>            Reporter: Aleksandar Putnik
>            Priority: Major
>         Attachments: Testdokument.pdf
>
>
> I'm using Tika (v1.18 , which means pdfbox 2.0.9) to extract the text from 
> PDF.
> I have a document from which the Acrobat Reader (Adobe Acrobat Reader DC) can 
> extract the text (although not with a 100% precision).
> Besides warnings "WARNING: No Unicode mapping for ... in font ArialMT" pdfbox 
> 2.0.9 doesn't return anything.
> As you can see from the warning, the font in question is ArialMT. It is 
> custom encoding and the pdf doesn't include toUnicode mapping. Font type is 
> CID TrueType (this info is provided by "pdffonts")
> "pdftotext" also can't extract anything but only shows an error `Syntax 
> Error: Unknown character collection 'Adobe-ArialMT'`
> The pdf producer (used by the customer) is VintaSoft PDF .NET Plug-in v5.5.
> I would like to determine whether there is a bug in pdfbox or the pdf 
> producer has to adjust and improve the "readability" of pdf.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4210) Unable to extract the text from a PDF ("No Unicode mapping.." warnings)

Reply via email to