[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382207#comment-16382207 ]
Tim Allison commented on TIKA-2593: ----------------------------------- bq. Which shapes are being extracted, are you able to share an example? TIKA-1945 Ok, that's diagram data, and you're right, "ShapeBasedContent" so far only means text boxes in docx. Fellow devs, any problem if we call diagram data "ShapeBasedContent"? bq. And comments were coming like "Comment by <name of the commentor>: <comment>" but the problem of that parser is that it can't extract from zip file. Is it possible to extract from zip file by using RecursiveParserWrapper? or is there a way I can use BasicContentHandlerFactory.HANDLER_TYPE.TEXT with auto detect parser so that I can get the comment ? I'm confused. Which parser can't extract from zip file? Yes, the RecursiveParserWrapper should handle zip files...if it can't, open a separate issue! I'm not sure what zip file and comments have to do with each other? > docx with track change producing incorrect output > ------------------------------------------------- > > Key: TIKA-2593 > URL: https://issues.apache.org/jira/browse/TIKA-2593 > Project: Tika > Issue Type: Bug > Components: core, handler > Affects Versions: 1.17 > Reporter: Md > Priority: Major > Attachments: sample.docx > > > I am using following code to extract text from docx file > {code:java} > AutoDetectParser parser = new AutoDetectParser(); > ContentHandler contentHandler = new BodyContentHandler(); > inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); > Metadata metadata = new Metadata(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setIncludeDeletedContent(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > parser.parse(inputStream, contentHandler, metadata, parseContext); > System.out.println(contentHandler.toString()); > {code} > When I am sending track revised files it's adding all the text deleted with > the actual text and inserted text. Is there a way to tell parser to exclude > the deleted text? > Here is an example > input Text: This is a sample text. -This part will- be deleted. +This is > inserted.+ > outputText: This is a sample text. This part will be deleted. This is > inserted. > Desired output: This is a sample text. be deleted. This is inserted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)