[ 
https://issues.apache.org/jira/browse/TIKA-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053406#comment-18053406
 ] 

Kabir Soneja commented on TIKA-4627:
------------------------------------

Thanks [~tilman] 

The issue is that the document contains only an image. But tika parser 3.2.2, 
is detecting some text *"\nimage2.png\n\n\n\n"* which cannot be seen in the 
document. In tika parser 1.28, this same document did not detect any text and 
therefore word count was 0

> Tika 3.2.2 text detection is detecting text which is not present in a document
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-4627
>                 URL: https://issues.apache.org/jira/browse/TIKA-4627
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Kabir Soneja
>            Priority: Major
>         Attachments: no_word_count_no_page_count.docx
>
>
> Hi, I am working on migrating from tike-parser 1.28 to tika-core, 
> tika-langdetect-optimaize and tika-parsers-standard-package 3.2.2.
>  
> During the migration, I am noticing some differences in the text detection 
> and word count returned from the document as compared to older tika version.
>  
> For a document (attached in this ticket) with just an image, version 3.2.2 is 
> detecting this text *"\nimage2.png\n\n\n\n"* which cannot be seen in the 
> document. What could be the reason for this and is this intended? How can I 
> avoid/handle such cases?
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to