[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text

Tim Allison (Jira) Thu, 23 Feb 2023 06:24:07 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692705#comment-17692705
 ]


Tim Allison commented on TIKA-3970:
-----------------------------------

Thank you, [~ndipiazza]!  Seriously, there is no warranty! :D

Should we reverse the iteration order of the pages?  I notice that we're 
getting page2 then page1 in one of our existing tests. So this might be a 
feature or something we're missing in our implementation?  I couldn't find 
anything in the spec about this.

Related: I noticed a "page number" property for each of the page nodes in the 
attached file.  Maybe we could use that info to order the pages when it exists?

This would require some walking the tree and caching page order.  I'm happy to 
give it a try.

Side note: I'm still really frustrated that I can't open a bunch of these files 
in OneNote even after I set up my Microsoft account and save the files in 
OneDrive.


> Certain OneNote documents produce duplicate text
> ------------------------------------------------
>
>                 Key: TIKA-3970
>                 URL: https://issues.apache.org/jira/browse/TIKA-3970
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 2.7.0
>            Reporter: David Avant
>            Priority: Minor
>         Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, 
> lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, 
> lyrics.txt
>
>
> Extracting text from certain OneNote documents produces more text than is 
> actually in the document.     In this case, the OneNote document was created 
> by opening a Word document and "printing" it to the OneNote.    
> To reproduce the issue, open the attached "lyrics.one" using the Tika App 
> version 2.7.0 and view the plain text.     Look for the phrase "Sunday 
> Morning" and observe that there are 14 occurrences.    However in the actual 
> displayed text, it occurs only once.      
> The original text in this document is only about 12K characters, but the 
> extracted text from tika is over 300K.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text

Reply via email to