[jira] [Commented] (TIKA-3765) setExtractAnnotationText setting ignored after v2.3.0

Natalia Escalera (Jira) Tue, 17 May 2022 08:50:04 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538293#comment-17538293
 ]


Natalia Escalera commented on TIKA-3765:
----------------------------------------

Ah, that's what's going on. Thanks for the explanation! Configuring the 
extractor to skip writing embedded file names to the content allows us to use 
the latest version. 

Would it make sense to change the default value for this option to false?  

> setExtractAnnotationText setting ignored after v2.3.0
> -----------------------------------------------------
>
>                 Key: TIKA-3765
>                 URL: https://issues.apache.org/jira/browse/TIKA-3765
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Natalia Escalera
>            Priority: Minor
>         Attachments: TIKA-3765.pptx
>
>
> Prior to version 2.3.0, setting the PDFParserConfig setExtractAnnotationText 
> to false ignored the image annotations in a power point document.
> This is no longer the case in Tika >= 2.3.0:
>  
> {code:java}
> content == "Test Test 1 "
> |       |
> |       false
> |       11 differences (52% similarity)
> |       Test Test 1 (image1.jpg )
> |       Test Test 1 (-----------)
> Test Test 1 image1.jpg {code}
>  
> This issue can be easily reproduced by creating a pptx document with an image 
> and calling tika with setExtractAnnotationText(false).
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3765) setExtractAnnotationText setting ignored after v2.3.0

Reply via email to