[
https://issues.apache.org/jira/browse/TIKA-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538293#comment-17538293
]
Natalia Escalera commented on TIKA-3765:
----------------------------------------
Ah, that's what's going on. Thanks for the explanation! Configuring the
extractor to skip writing embedded file names to the content allows us to use
the latest version.
Would it make sense to change the default value for this option to false?
> setExtractAnnotationText setting ignored after v2.3.0
> -----------------------------------------------------
>
> Key: TIKA-3765
> URL: https://issues.apache.org/jira/browse/TIKA-3765
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.3.0, 2.4.0
> Reporter: Natalia Escalera
> Priority: Minor
> Attachments: TIKA-3765.pptx
>
>
> Prior to version 2.3.0, setting the PDFParserConfig setExtractAnnotationText
> to false ignored the image annotations in a power point document.
> This is no longer the case in Tika >= 2.3.0:
>
> {code:java}
> content == "Test Test 1 "
> | |
> | false
> | 11 differences (52% similarity)
> | Test Test 1 (image1.jpg )
> | Test Test 1 (-----------)
> Test Test 1 image1.jpg {code}
>
> This issue can be easily reproduced by creating a pptx document with an image
> and calling tika with setExtractAnnotationText(false).
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)