[
https://issues.apache.org/jira/browse/PDFBOX-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113269#comment-16113269
]
Tilman Hausherr edited comment on PDFBOX-3886 at 8/3/17 6:39 PM:
-----------------------------------------------------------------
Please read this:
https://pdfbox.apache.org/2.0/faq.html#notext
I had a look at the two files - the fonts have no ToUnicode stream so PDFBox
can't know. (That's also why you're getting so many warning messages).
You have to understand that what you see is just pictures of vector drawings (=
glyphs). They are missing the unicode assignment. This is possibly done on
purpose, to prevent text extraction. All that could be done is OCR. Apache TIKA
has an option for this.
was (Author: tilman):
Please read this:
https://pdfbox.apache.org/2.0/faq.html#notext
I had a look at the two files - there's no ToUnicode stream so PDFBox can't
know. (That's also why you're getting so many warning messages).
You have to understand that what you see is just pictures of vector drawings (=
glyphs). They are missing the unicode assignment. This is possibly done on
purpose, to prevent text extraction. All that could be done is OCR. Apache TIKA
has an option for this.
> PdfBox is not able to extract text from the documents attached
> --------------------------------------------------------------
>
> Key: PDFBOX-3886
> URL: https://issues.apache.org/jira/browse/PDFBOX-3886
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing, Text extraction
> Affects Versions: 2.0.5, 2.0.6, 2.0.7
> Environment: Windows 10 64-bit, Ubuntu 14.04 64-bit.
> java version "1.8.0_141"
> Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)
> Reporter: Harun Reşit Zafer
> Labels: extraction
> Attachments: non-contract_00099.pdf, non-contract_01346_form.pdf
>
>
> PdfBox returns a few empty lines for the documents attached. This is tested
> with versions 2.0.5, 2.0.6, and 2.0.7.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]