[jira] [Commented] (PDFBOX-4311) Unable to parse some pdf's using pdfbox.

Tilman Hausherr (JIRA) Tue, 04 Sep 2018 11:57:16 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603449#comment-16603449
 ]


Tilman Hausherr commented on PDFBOX-4311:
-----------------------------------------

The "text" you see is an image, so there is no text to extract. You can see 
this by trying to mark and copy and paste in Adobe Reader.

[https://pdfbox.apache.org/2.0/faq.html#text-extraction]

So, sadly, there is nothing we can do this time. Btw the current version is 
2.0.11 (but that won't help either). Sorry for not having better news.

> Unable to parse some pdf's using pdfbox.
> ----------------------------------------
>
>                 Key: PDFBOX-4311
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4311
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>         Environment: Pdfbox -2.0.9
> Pdfbox-tools - 2.0.9
> Java - 1.7
> Scala - 2.10.6
>            Reporter: Krishna Dheeraj
>            Priority: Major
>         Attachments: 
> upload_user4024353_claimnr283909709_healthpartners_2018-06-17.pdf
>
>
> When I tried to convert the PDF file into HTML for parsing the content in the 
> body is empty and there are no errors or exceptions thrown. It is happening 
> for only few files, others are are working as expected. I am attaching the 
> file which we are unable to parse. Let us know know in case of any 
> resolutions are avilable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4311) Unable to parse some pdf's using pdfbox.

Reply via email to