[jira] [Updated] (PDFBOX-4210) Unable to extract the text from a PDF ("No Unicode mapping.." warnings)

Aleksandar Putnik (JIRA) Tue, 08 May 2018 00:22:20 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aleksandar Putnik updated PDFBOX-4210:
--------------------------------------
    Attachment: Testdokument.pdf

> Unable to extract the text from a PDF ("No Unicode mapping.." warnings)
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-4210
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4210
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>            Reporter: Aleksandar Putnik
>            Priority: Major
>         Attachments: Testdokument.pdf
>
>
> I'm using Tika (v1.18 , which means pdfbox 2.0.9) to extract the text from 
> PDF.
> I have a document from which the Acrobat Reader (Adobe Acrobat Reader DC) can 
> extract the text (although not with a 100% precision).
> Besides warnings "WARNING: No Unicode mapping for ... in font ArialMT" pdfbox 
> 2.0.9 doesn't return anything.
> As you can see from the warning, the font in question is ArialMT. It is 
> custom encoding and the pdf doesn't include toUnicode mapping. Font type is 
> CID TrueType (this info is provided by "pdffonts")
> "pdftotext" also can't extract anything but only shows an error `Syntax 
> Error: Unknown character collection 'Adobe-ArialMT'`
> The pdf producer (used by the customer) is VintaSoft PDF .NET Plug-in v5.5.
> I would like to determine whether there is a bug in pdfbox or the pdf 
> producer has to adjust and improve the "readability" of pdf.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-4210) Unable to extract the text from a PDF ("No Unicode mapping.." warnings)

Reply via email to