[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

Tim Allison (JIRA) Thu, 01 Mar 2018 07:58:38 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382207#comment-16382207
 ]


Tim Allison commented on TIKA-2593:
-----------------------------------

bq. Which shapes are being extracted, are you able to share an example?  
TIKA-1945
Ok, that's diagram data, and you're right, "ShapeBasedContent" so far only 
means text boxes in docx.  Fellow devs, any problem if we call diagram data 
"ShapeBasedContent"?

bq. And comments were coming like "Comment by <name of the commentor>: 
<comment>" but the problem of that parser is that it can't extract from zip 
file. Is it possible to extract from zip file by using RecursiveParserWrapper? 
or is there a way I can use BasicContentHandlerFactory.HANDLER_TYPE.TEXT with 
auto detect parser so that I can get the comment ?

I'm confused.  Which parser can't extract from zip file?  Yes, the 
RecursiveParserWrapper should handle zip files...if it can't, open a separate 
issue!  I'm not sure what zip file and comments have to do with each other?


> docx with track change producing incorrect output
> -------------------------------------------------
>
>                 Key: TIKA-2593
>                 URL: https://issues.apache.org/jira/browse/TIKA-2593
>             Project: Tika
>          Issue Type: Bug
>          Components: core, handler
>    Affects Versions: 1.17
>            Reporter: Md
>            Priority: Major
>         Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

Reply via email to