[jira] [Commented] (TIKA-3765) setExtractAnnotationText setting ignored after v2.3.0

Natalia Escalera (Jira) Tue, 17 May 2022 06:27:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538187#comment-17538187
 ]


Natalia Escalera commented on TIKA-3765:
----------------------------------------

Hey Tim,

Thanks for the quick reply! Here is a test file: [^TIKA-3765.pptx]

I'm using AutoDetectParser in the method that extracts the document's text and 
the only configuration being passed to the parser context is the 
PDFParserConfig:

 
{code:java}
public PDFParserConfig pdfConfig() {    
  PDFParserConfig config = new PDFParserConfig();    
  config.setExtractAnnotationText(false);    
  return config;  
} {code}
 

 

 
{code:java}
...
AutoDetectParser parser = new AutoDetectParser();  
ParseContext parseContext = new ParseContext();  
parseContext.set(Parser.class, parser);  
parseContext.set(PDFParserConfig.class, pdfConfig());  
Metadata tikaMetadata = new Metadata(); 
try {    
  parser.parse(objectData, handler, tikaMetadata, parseContext);    
... {code}
 

Let me know if you need further details from me.

 

> setExtractAnnotationText setting ignored after v2.3.0
> -----------------------------------------------------
>
>                 Key: TIKA-3765
>                 URL: https://issues.apache.org/jira/browse/TIKA-3765
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Natalia Escalera
>            Priority: Minor
>         Attachments: TIKA-3765.pptx
>
>
> Prior to version 2.3.0, setting the PDFParserConfig setExtractAnnotationText 
> to false ignored the image annotations in a power point document.
> This is no longer the case in Tika >= 2.3.0:
>  
> {code:java}
> content == "Test Test 1 "
> |       |
> |       false
> |       11 differences (52% similarity)
> |       Test Test 1 (image1.jpg )
> |       Test Test 1 (-----------)
> Test Test 1 image1.jpg {code}
>  
> This issue can be easily reproduced by creating a pptx document with an image 
> and calling tika with setExtractAnnotationText(false).
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3765) setExtractAnnotationText setting ignored after v2.3.0

Reply via email to