[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382125#comment-16382125
 ] 

Tim Allison edited comment on TIKA-2593 at 3/1/18 3:09 PM:
-----------------------------------------------------------

bq. I think I did figure it out. I need to set 
officeParserConfig.setUseSAXDocxExtractor(true);

Sorry for not responding sooner...IIRC, we can't yet remove deleted contents 
with our regular DOM parser, so you do have to use the SAXDocx parser.

bq. But still doesn't work for 
officeParserConfig.setIncludeShapeBasedContent(false);

If {{setIncludeShapeBasedContent}} is set to false, are you saying that deleted 
content comes through?!



was (Author: talli...@mitre.org):
bq. I think I did figure it out. I need to set 
officeParserConfig.setUseSAXDocxExtractor(true);

Sorry for not responsding...IIRC, we can't yet remove deleted contents with our 
regular DOM parser, so you do have to use the SAXDocx parser.

bq. But still doesn't work for 
officeParserConfig.setIncludeShapeBasedContent(false);

If {{setIncludeShapeBasedContent}} is set to false, are you saying that deleted 
content comes through?!


> docx with track change producing incorrect output
> -------------------------------------------------
>
>                 Key: TIKA-2593
>                 URL: https://issues.apache.org/jira/browse/TIKA-2593
>             Project: Tika
>          Issue Type: Bug
>          Components: core, handler
>    Affects Versions: 1.17
>            Reporter: Md
>            Priority: Major
>         Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to