[ 
https://issues.apache.org/jira/browse/PDFBOX-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16467264#comment-16467264
 ] 

Tilman Hausherr commented on PDFBOX-4210:
-----------------------------------------

copy & paste in Adobe Reader brings text… same for PDF.js and Chrome, but not 
Edge. There is no ToUnicode stream. There is a encoding cmap but that is 
different... the Unicode thing would have been in "beginbfchar" or 
"beginbfrange" and there's no such thing. So Adobe, PDF.js and Chrome have some 
fallback logic and I don't know how / why.

> Unable to extract the text from a PDF ("No Unicode mapping.." warnings)
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-4210
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4210
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>            Reporter: Aleksandar Putnik
>            Priority: Major
>         Attachments: Testdokument.pdf
>
>
> I'm using Tika (v1.18 , which means pdfbox 2.0.9) to extract the text from 
> PDF.
> I have a document from which the Acrobat Reader (Adobe Acrobat Reader DC) can 
> extract the text (although not with a 100% precision).
> Besides warnings "WARNING: No Unicode mapping for ... in font ArialMT" pdfbox 
> 2.0.9 doesn't return anything.
> As you can see from the warning, the font in question is ArialMT. It is 
> custom encoding and the pdf doesn't include toUnicode mapping. Font type is 
> CID TrueType (this info is provided by "pdffonts")
> "pdftotext" also can't extract anything but only shows an error `Syntax 
> Error: Unknown character collection 'Adobe-ArialMT'`
> The pdf producer (used by the customer) is VintaSoft PDF .NET Plug-in v5.5.
> I would like to determine whether there is a bug in pdfbox or the pdf 
> producer has to adjust and improve the "readability" of pdf.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to