[
https://issues.apache.org/jira/browse/TIKA-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053409#comment-18053409
]
Tilman Hausherr commented on TIKA-4627:
---------------------------------------
image2.png is the name of the image. I compared the code of the two versions,
version 1 has this:
{code:java}
if (name != null && name.length() > 0 && outputHtml) {
handler.startElement(XHTML, "h1", "h1", new AttributesImpl());
char[] chars = name.toCharArray();
handler.characters(chars, 0, chars.length);
handler.endElement(XHTML, "h1", "h1");
}
{code}
version 3 has this:
{code:java}
String name = metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
if (writeFileNameToContent && name != null && name.length() > 0 &&
outputHtml) {
handler.startElement(XHTML, "h1", "h1", new AttributesImpl());
char[] chars = name.toCharArray();
handler.characters(chars, 0, chars.length);
handler.endElement(XHTML, "h1", "h1");
}
{code}
writeFileNameToContent is true by default. It is configurable and was
introduced in TIKA-3711.
> Tika 3.2.2 text detection is detecting text which is not present in a document
> ------------------------------------------------------------------------------
>
> Key: TIKA-4627
> URL: https://issues.apache.org/jira/browse/TIKA-4627
> Project: Tika
> Issue Type: Bug
> Reporter: Kabir Soneja
> Priority: Major
> Attachments: no_word_count_no_page_count.docx
>
>
> Hi, I am working on migrating from tike-parser 1.28 to tika-core,
> tika-langdetect-optimaize and tika-parsers-standard-package 3.2.2.
>
> During the migration, I am noticing some differences in the text detection
> and word count returned from the document as compared to older tika version.
>
> For a document (attached in this ticket) with just an image, version 3.2.2 is
> detecting this text *"\nimage2.png\n\n\n\n"* which cannot be seen in the
> document. What could be the reason for this and is this intended? How can I
> avoid/handle such cases?
>
> Thanks
--
This message was sent by Atlassian Jira
(v8.20.10#820010)