[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text

Tim Allison (Jira) Tue, 21 Feb 2023 10:36:07 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691745#comment-17691745
 ]


Tim Allison commented on TIKA-3970:
-----------------------------------

The reason the first page is repeated is that it is included in both the 
ElemenChildNodesOfVersionHistory -- a list of the child elements for a page -- 
AND in the StructureElementChildNodes -- which is a reference to the title node 
of a page.

The current code literally walks the tree and processes every element.  To fix 
this, we'd need to add a bit more processing to avoid extracting text from the 
StructureElementChildNodes if the single element in there is already processed 
(or is going to be processed).

This does not fix the reversed order of the pages, but this does get us closer.

> Certain OneNote documents produce duplicate text
> ------------------------------------------------
>
>                 Key: TIKA-3970
>                 URL: https://issues.apache.org/jira/browse/TIKA-3970
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 2.7.0
>            Reporter: David Avant
>            Priority: Minor
>         Attachments: lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, 
> lyrics.one, lyrics.txt
>
>
> Extracting text from certain OneNote documents produces more text than is 
> actually in the document.     In this case, the OneNote document was created 
> by opening a Word document and "printing" it to the OneNote.    
> To reproduce the issue, open the attached "lyrics.one" using the Tika App 
> version 2.7.0 and view the plain text.     Look for the phrase "Sunday 
> Morning" and observe that there are 14 occurrences.    However in the actual 
> displayed text, it occurs only once.      
> The original text in this document is only about 12K characters, but the 
> extracted text from tika is over 300K.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text

Reply via email to