[ 
https://issues.apache.org/jira/browse/TIKA-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15709719#comment-15709719
 ] 

Tim Allison commented on TIKA-2036:
-----------------------------------

On TIKA-1321, I added a new experimental SAXParser that processes the 
document.xml within docx files directly.  You can turn off extraction of 
deleted content with this new parser.  You can also turn on extraction of 
"moveFrom" runs.

This new parser only works with .docx, not .doc.  Please give it a try and let 
us know how it is working for you.

> Deleted Text from Word File Shows Up in Extract
> -----------------------------------------------
>
>                 Key: TIKA-2036
>                 URL: https://issues.apache.org/jira/browse/TIKA-2036
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.13
>         Environment: Windows, under TikaOnDotNet
>            Reporter: Steve Gullion
>              Labels: word
>
> A .docx file, with "track changes" on, includes deleted text. In this case, 
> there are two overlapping deletions:
> 9.    [DELETED:This Agreement shall be governed by and construed in 
> accordance with [INSERTED, THEN DELETED:Arizona] New York law] (Intentionally 
> omitted.)
> The text should only include "9. (Intentionally omitted)". However, the 
> output is "9. This Agreement shall be governed and construed in accordance 
> with New York law." So it recognizes "Arizona" as deleted, but not the rest 
> of it.
> Edit: this is worse than I originally thought. ALL deleted text is showing up 
> in text exported from other Word docs. I saw this reported in 2011, and there 
> was supposedly a patch, but apparently it doesn't work, or something else was 
> changed. Is there an option somewhere that provides for the exclusion of 
> deleted text generally?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to