[ 
https://issues.apache.org/jira/browse/PDFBOX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465619#comment-16465619
 ] 

Aleksandar Putnik commented on PDFBOX-3438:
-------------------------------------------

Hi,

I see that the ticket is closed but the title of this ticket fits well to my 
problem.

Unlike the first case, here I have a document from which the Acrobat Reader 
(Adobe Acrobat Reader DC) can extract the text (although not with a 100% 
precision).

pdfbox 1.8.6 returns question marks while pdfbox 2.0.9 returns nothing (besides 
those warnings about missing unicode mapping).

The font in question is ArialMT with custom encoding and the pdf doesn't 
include toUnicode mapping.

`pdftotext` also can't extract anything but only show an error `Syntax Error: 
Unknown character collection 'Adobe-ArialMT'`

The possible culprit may also be the pdf producer (used by the customer) - 
VintaSoft PDF .NET Plug-in v5.5, but there I'm really not sure.

What should be the next step here?

Thanks

 

> only garbage extracted, lots of warnings "No Unicode mapping..."
> ----------------------------------------------------------------
>
>                 Key: PDFBOX-3438
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3438
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 2.0.2
>            Reporter: Oliver Steinau
>            Priority: Major
>         Attachments: PDFBOX-3438.diff, PDFBOX-3438.txt, test.pdf
>
>
> When I try to extract text from this PDF, I get lots of warnings "No Unicode 
> mapping for ...", and as output I only get garbage.
> PDF file displays fine in Acrobat Reader, and pdftotext.exe will extract the 
> text just fine.
> PDF file seems to have a Type-1 font embedded with a custom encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to