[
https://issues.apache.org/jira/browse/PDFBOX-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603449#comment-16603449
]
Tilman Hausherr commented on PDFBOX-4311:
-----------------------------------------
The "text" you see is an image, so there is no text to extract. You can see
this by trying to mark and copy and paste in Adobe Reader.
[https://pdfbox.apache.org/2.0/faq.html#text-extraction]
So, sadly, there is nothing we can do this time. Btw the current version is
2.0.11 (but that won't help either). Sorry for not having better news.
> Unable to parse some pdf's using pdfbox.
> ----------------------------------------
>
> Key: PDFBOX-4311
> URL: https://issues.apache.org/jira/browse/PDFBOX-4311
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.9
> Environment: Pdfbox -2.0.9
> Pdfbox-tools - 2.0.9
> Java - 1.7
> Scala - 2.10.6
> Reporter: Krishna Dheeraj
> Priority: Major
> Attachments:
> upload_user4024353_claimnr283909709_healthpartners_2018-06-17.pdf
>
>
> When I tried to convert the PDF file into HTML for parsing the content in the
> body is empty and there are no errors or exceptions thrown. It is happening
> for only few files, others are are working as expected. I am attaching the
> file which we are unable to parse. Let us know know in case of any
> resolutions are avilable.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]