[
https://issues.apache.org/jira/browse/PDFBOX-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksandar Putnik updated PDFBOX-4210:
--------------------------------------
Attachment: Testdokument.pdf
> Unable to extract the text from a PDF ("No Unicode mapping.." warnings)
> -----------------------------------------------------------------------
>
> Key: PDFBOX-4210
> URL: https://issues.apache.org/jira/browse/PDFBOX-4210
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.9
> Reporter: Aleksandar Putnik
> Priority: Major
> Attachments: Testdokument.pdf
>
>
> I'm using Tika (v1.18 , which means pdfbox 2.0.9) to extract the text from
> PDF.
> I have a document from which the Acrobat Reader (Adobe Acrobat Reader DC) can
> extract the text (although not with a 100% precision).
> Besides warnings "WARNING: No Unicode mapping for ... in font ArialMT" pdfbox
> 2.0.9 doesn't return anything.
> As you can see from the warning, the font in question is ArialMT. It is
> custom encoding and the pdf doesn't include toUnicode mapping. Font type is
> CID TrueType (this info is provided by "pdffonts")
> "pdftotext" also can't extract anything but only shows an error `Syntax
> Error: Unknown character collection 'Adobe-ArialMT'`
> The pdf producer (used by the customer) is VintaSoft PDF .NET Plug-in v5.5.
> I would like to determine whether there is a bug in pdfbox or the pdf
> producer has to adjust and improve the "readability" of pdf.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]