[
https://issues.apache.org/jira/browse/TIKA-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537844#comment-17537844
]
Tim Allison commented on TIKA-3765:
-----------------------------------
Thank you for opening this issue. Can you share an example file? I'm having
trouble understanding how a PDF parser config setting has an effect on a PPT.
I trust you're seeing what you're seeing. Something may have changed in the
underlying POI or in our parser around it.
> setExtractAnnotationText setting ignored after v2.3.0
> -----------------------------------------------------
>
> Key: TIKA-3765
> URL: https://issues.apache.org/jira/browse/TIKA-3765
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.3.0, 2.4.0
> Reporter: Natalia Escalera
> Priority: Minor
>
> Prior to version 2.3.0, setting the PDFParserConfig setExtractAnnotationText
> to false ignored the image annotations in a power point document.
> This is no longer the case in Tika >= 2.3.0:
>
> {code:java}
> content == "Test Test 1 "
> | |
> | false
> | 11 differences (52% similarity)
> | Test Test 1 (image1.jpg )
> | Test Test 1 (-----------)
> Test Test 1 image1.jpg {code}
>
> This issue can be easily reproduced by creating a pptx document with an image
> and calling tika with setExtractAnnotationText(false).
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)