[jira] [Comment Edited] (TIKA-4627) Tika 3.2.2 text detection is detecting text which is not present in a document

Tilman Hausherr (Jira) Wed, 21 Jan 2026 13:35:15 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053409#comment-18053409
 ]


Tilman Hausherr edited comment on TIKA-4627 at 1/21/26 9:34 PM:
----------------------------------------------------------------

image2.png is the name of the image. I compared the code of the two versions, 
version 1 has this:
{code:java}
        String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
        if (name != null && name.length() > 0 && outputHtml) {
            handler.startElement(XHTML, "h1", "h1", new AttributesImpl());
            char[] chars = name.toCharArray();
            handler.characters(chars, 0, chars.length);
            handler.endElement(XHTML, "h1", "h1");
        }
{code}
version 3 has this:
{code:java}
        String name = metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
        if (writeFileNameToContent && name != null && name.length() > 0 && 
outputHtml) {
            handler.startElement(XHTML, "h1", "h1", new AttributesImpl());
            char[] chars = name.toCharArray();
            handler.characters(chars, 0, chars.length);
            handler.endElement(XHTML, "h1", "h1");
        }
{code}

writeFileNameToContent is true by default. It is configurable and was 
introduced in TIKA-3711.

I found an xml file with it:
{code:xml}
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
  </parsers>
  <autoDetectParserConfig>
    <spoolToDisk>123450</spoolToDisk>
    <outputThreshold>678900</outputThreshold>
    <embeddedDocumentExtractorFactory 
class="org.apache.tika.extractor.RUnpackExtractorFactory">
      <writeFileNameToContent>false</writeFileNameToContent>
    </embeddedDocumentExtractorFactory>
  </autoDetectParserConfig>
</properties>

{code}


was (Author: tilman):
image2.png is the name of the image. I compared the code of the two versions, 
version 1 has this:
{code:java}
        String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
        if (name != null && name.length() > 0 && outputHtml) {
            handler.startElement(XHTML, "h1", "h1", new AttributesImpl());
            char[] chars = name.toCharArray();
            handler.characters(chars, 0, chars.length);
            handler.endElement(XHTML, "h1", "h1");
        }
{code}
version 3 has this:
{code:java}
        String name = metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
        if (writeFileNameToContent && name != null && name.length() > 0 && 
outputHtml) {
            handler.startElement(XHTML, "h1", "h1", new AttributesImpl());
            char[] chars = name.toCharArray();
            handler.characters(chars, 0, chars.length);
            handler.endElement(XHTML, "h1", "h1");
        }
{code}

writeFileNameToContent is true by default. It is configurable and was 
introduced in TIKA-3711.

> Tika 3.2.2 text detection is detecting text which is not present in a document
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-4627
>                 URL: https://issues.apache.org/jira/browse/TIKA-4627
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Kabir Soneja
>            Priority: Major
>         Attachments: no_word_count_no_page_count.docx
>
>
> Hi, I am working on migrating from tike-parser 1.28 to tika-core, 
> tika-langdetect-optimaize and tika-parsers-standard-package 3.2.2.
>  
> During the migration, I am noticing some differences in the text detection 
> and word count returned from the document as compared to older tika version.
>  
> For a document (attached in this ticket) with just an image, version 3.2.2 is 
> detecting this text *"\nimage2.png\n\n\n\n"* which cannot be seen in the 
> document. What could be the reason for this and is this intended? How can I 
> avoid/handle such cases?
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-4627) Tika 3.2.2 text detection is detecting text which is not present in a document

Reply via email to