[
https://issues.apache.org/jira/browse/PDFBOX-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740706#comment-16740706
]
Tilman Hausherr commented on PDFBOX-4431:
-----------------------------------------
PDFBox can extract all text when the fonts work and the PDF has Unicode
mappings so I don't understand your question. For me, the attached PDF was
fine. It's the tool you have tried that may not work fully, but I haven't
investigated that one. Maybe it's a bug there, or maybe it works only with some
PDF structures.
You can convert text back to PDF with the TextToPDF utility, also from
pdfbox-app.
> PDFBox recognizes only a few words
> ----------------------------------
>
> Key: PDFBOX-4431
> URL: https://issues.apache.org/jira/browse/PDFBOX-4431
> Project: PDFBox
> Issue Type: Bug
> Components: Documentation, Text extraction
> Environment: OS: Windows 10.
> IDE: Oxygen.3a Release (4.7.3a)
> PDF version: Adobe Acrobat Pro DC - 2019.010.20069.49826
> Reporter: Krutheeka Rajkumar
> Priority: Major
> Attachments: RS13170.pdf, RS13170.txt
>
>
> The code I have posted takes in 5 arguments which include the location to a
> pdf document and a search term. The code is to parse through the PDF document
> and return all the matches to the keyword in the document and return their
> locations depending on the format (last given argument).
> The code for some reason recognizes only a few words and errors on other
> words. I am not sure why this is.
> There seems to be no difference in these words in terms of font, size
> location etc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]