[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517167#comment-17517167
 ] 

Sam Stephens commented on TIKA-3711:
------------------------------------

Regarding filenames, I don't think they will ever be semantically meaningful. I 
just created a document with Word 365 (uploaded as 
word-doc-with-image-from-word-365.docx), added a picture with the filename 
test-image.png, and the extracted filename is still image1.png. I think Word is 
creating non-interesting filenames.

As far as breaking other users, I'm raising this bug because this *is* a change 
in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where 
images are not part of the text.

Using the ToXMLContentHandler to look at the actual generated HTML, I think 
that's also going to be surprising behavior for most users.

{{<p><img src="embedded:image.png" alt="" /></p>}}
{{<div class="package-entry"><h1>image.png</h1>}}
{{</div>}}

My image actually has alt text; that's not included. And I think including the 
image file name as a header in the markup is going to be surprising to almost 
every user. It certainly doesn't match the source document (which has no 
headers, or visible text of any kind).

As an end user, what I'd like is the XHTML to be

{{<p><img src="embedded:image.png" alt="Test Alt Text" /></p>}}
{{}}

And the text from BodyContentHandler to not include the image at all. That way 
the text is the text, and if I have an interest in Image alt tags, I can 
operate on the XHTML.

If you wanted to include an option to provide text for the image, I don't think 
image filenames will ever be useful from Word; alt text is the right place 
semantically to be looking for a textual representation of an image.

> Image file names included in parsed Word Document text
> ------------------------------------------------------
>
>                 Key: TIKA-3711
>                 URL: https://issues.apache.org/jira/browse/TIKA-3711
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.3.0
>            Reporter: Sam Stephens
>            Priority: Minor
>         Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to