Kabir Soneja created TIKA-4627:
----------------------------------
Summary: Tika 3.2.2 text detection is detecting text which is not
present in a document
Key: TIKA-4627
URL: https://issues.apache.org/jira/browse/TIKA-4627
Project: Tika
Issue Type: Bug
Reporter: Kabir Soneja
Attachments: no_word_count_no_page_count.docx
Hi, I am working on migrating from tike-parser 1.28 to tika-core,
tika-langdetect-optimaize and tika-parsers-standard-package 3.2.2.
During the migration, I am noticing some differences in the text detection and
word count returned from the document as compared to older tika version.
For a document (attached in this ticket) with just an image, version 3.2.2 is
detecting this text *"\nimage2.png\n\n\n\n"* which cannot be seen in the
document. What could be the reason for this and is this intended? How can I
avoid/handle such cases?
Thanks
--
This message was sent by Atlassian Jira
(v8.20.10#820010)