[ 
https://issues.apache.org/jira/browse/PDFBOX-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740706#comment-16740706
 ] 

Tilman Hausherr commented on PDFBOX-4431:
-----------------------------------------

PDFBox can extract all text when the fonts work and the PDF has Unicode 
mappings so I don't understand your question. For me, the attached PDF was 
fine. It's the tool you have tried that may not work fully, but I haven't 
investigated that one. Maybe it's a bug there, or maybe it works only with some 
PDF structures.

You can convert text back to PDF with the TextToPDF utility, also from 
pdfbox-app.

> PDFBox recognizes only a few words
> ----------------------------------
>
>                 Key: PDFBOX-4431
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4431
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Documentation, Text extraction
>         Environment: OS: Windows 10.
> IDE: Oxygen.3a Release (4.7.3a)
> PDF version: Adobe Acrobat Pro DC - 2019.010.20069.49826
>            Reporter: Krutheeka Rajkumar
>            Priority: Major
>         Attachments: RS13170.pdf, RS13170.txt
>
>
> The code I have posted takes in 5 arguments which include the location to a 
> pdf document and a search term. The code is to parse through the PDF document 
> and return all the matches to the keyword in the document and return their 
> locations depending on the format (last given argument).
> The code for some reason recognizes only a few words and errors on other 
> words. I am not sure why this is.
> There seems to be no difference in these words in terms of font, size 
> location etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to