[jira] [Commented] (PDFBOX-4431) PDFBox recognizes only a few words

Krutheeka Rajkumar (JIRA) Thu, 10 Jan 2019 12:09:05 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16739732#comment-16739732
 ]


Krutheeka Rajkumar commented on PDFBOX-4431:
--------------------------------------------

[~tilman] The code is set to parse through a given PDF (which is one of the 5 
arguments that is passed) and match it to a search term (another argument that 
is passed). Once this match happens, the code returns the location (position on 
the paper). The issue is that, the code picks up certain words (like, 
"victoria") and returns all instances of the word in the pdf with the accurate 
page numbers and text positions. But other words (like, "the") are not being 
picked up at all. The error is given, "Hmmm, looks like something went wrong." 

> PDFBox recognizes only a few words
> ----------------------------------
>
>                 Key: PDFBOX-4431
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4431
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Documentation, Text extraction
>         Environment: OS: Windows 10.
> IDE: Oxygen.3a Release (4.7.3a)
> PDF version: Adobe Acrobat Pro DC - 2019.010.20069.49826
>            Reporter: Krutheeka Rajkumar
>            Priority: Major
>         Attachments: RS13170.pdf, RS13170.txt
>
>
> The code I have posted takes in 5 arguments which include the location to a 
> pdf document and a search term. The code is to parse through the PDF document 
> and return all the matches to the keyword in the document and return their 
> locations depending on the format (last given argument).
> The code for some reason recognizes only a few words and errors on other 
> words. I am not sure why this is.
> There seems to be no difference in these words in terms of font, size 
> location etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4431) PDFBox recognizes only a few words

Reply via email to