[jira] [Commented] (PDFBOX-3438) only garbage extracted, lots of warnings "No Unicode mapping..."

Tilman Hausherr (JIRA) Fri, 13 Jan 2017 22:43:13 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822728#comment-15822728
 ]


Tilman Hausherr commented on PDFBOX-3438:
-----------------------------------------

You could look at the text with the PDFDebugger command line tool (download the 
PDFBox-app jar file and run that one). Open Page, Resources, Font, then click 
on each font and look at the column "Glyph name". In the test.pdf file here, it 
is F2 that starts with "C0032".

> only garbage extracted, lots of warnings "No Unicode mapping..."
> ----------------------------------------------------------------
>
>                 Key: PDFBOX-3438
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3438
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 2.0.2
>            Reporter: Oliver Steinau
>         Attachments: PDFBOX-3438.diff, PDFBOX-3438.txt, test.pdf
>
>
> When I try to extract text from this PDF, I get lots of warnings "No Unicode 
> mapping for ...", and as output I only get garbage.
> PDF file displays fine in Acrobat Reader, and pdftotext.exe will extract the 
> text just fine.
> PDF file seems to have a Type-1 font embedded with a custom encoding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3438) only garbage extracted, lots of warnings "No Unicode mapping..."

Reply via email to