Kabir Soneja created TIKA-4627:
----------------------------------

             Summary: Tika 3.2.2 text detection is detecting text which is not 
present in a document
                 Key: TIKA-4627
                 URL: https://issues.apache.org/jira/browse/TIKA-4627
             Project: Tika
          Issue Type: Bug
            Reporter: Kabir Soneja
         Attachments: no_word_count_no_page_count.docx

Hi, I am working on migrating from tike-parser 1.28 to tika-core, 
tika-langdetect-optimaize and tika-parsers-standard-package 3.2.2.
 
During the migration, I am noticing some differences in the text detection and 
word count returned from the document as compared to older tika version.
 
For a document (attached in this ticket) with just an image, version 3.2.2 is 
detecting this text *"\nimage2.png\n\n\n\n"* which cannot be seen in the 
document. What could be the reason for this and is this intended? How can I 
avoid/handle such cases?
 
Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to