[jira] [Commented] (TIKA-207) MS word doc containing tracked changes produces incorrect text

Md (JIRA) Wed, 28 Feb 2018 09:56:19 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380752#comment-16380752
 ]


Md commented on TIKA-207:
-------------------------

I am using tika 1.17 but still it's getting deleted text from track revised 
files. Is there a way to exclude deleted test from tracked revised files.

 

> MS word doc containing tracked changes produces incorrect text
> --------------------------------------------------------------
>
>                 Key: TIKA-207
>                 URL: https://issues.apache.org/jira/browse/TIKA-207
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>         Environment: tika-0.3-standalone.jar
>            Reporter: Michael McCandless
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.10
>
>         Attachments: TIKA-207.patch, TIKA-207.patch
>
>
> Spinoff from this discussion:
>   
> http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html
> When extracting text from an MS Word doc (2003 format) that has
> unapproved pending changes, the text from both old and new is glommed
> together.
> EG I had a doc that contained text "Field.Index.TOKENIZED", and I
> changed TOKENIZED to ANALYZED with track changes enabled, and
> then when I extract text (using TikaCLI) it produces this:
>   Field.Index.TOKENIZEDANALYZED
> So, first, it'd be nice to at least get whitespace inserted between
> old & new text.
> And, second, it'd be great to have an option to control whether it's
> old or new text that's indexed (or at least an option to only see
> "new" text, ie the current document).
> From the discussion above, it seems like POI may expose the
> fine-grained APIs to allow Tika to do this; it's just that Tika's not
> leveraging these APIs  for MS Word docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-207) MS word doc containing tracked changes produces incorrect text

Reply via email to