[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text

Nicholas DiPiazza (Jira) Thu, 23 Feb 2023 19:03:04 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692984#comment-17692984
 ]


Nicholas DiPiazza commented on TIKA-3970:
-----------------------------------------

> Should we reverse the iteration order of the pages? I notice that we're 
> getting page2 then page1 in one of our existing tests. So this might be a 
> feature or something we're missing in our implementation? I couldn't find 
> anything in the spec about this. Related: I noticed a "page number" property 
> for each of the page nodes in the attached file. Maybe we could use that info 
> to order the pages when it exists?

Yeah sure! That sounds like a really good idea.

> This would require some walking the tree and caching page order. I'm happy to 
> give it a try.

Yeah! that's what I spent a few hours doing with this PR above. I need to spend 
some more time on it probably i just kinda got the Jira's test case to work.

> Side note: I'm still really frustrated that I can't open a bunch of these 
> files in OneNote even after I set up my Microsoft account and save the files 
> in OneDrive.

Yeah so there are two types of OneNote files, the MS-ONESTORE spec, and the 
ones that use the alternative packaging MS-FSSHTTPD.

If you open a file from onenote office 365, it will use the alternative 
packaging. 

If you open a file from onenote from local microsoft office 365, it will use 
the ms-onestore spec.

So I think you might need to grab a copy of MS office: 
[https://support.microsoft.com/en-us/office/use-the-office-offline-installer-f0a85fe7-118f-41cb-a791-d59cef96ad1c]
 you could then work with this.

> Certain OneNote documents produce duplicate text
> ------------------------------------------------
>
>                 Key: TIKA-3970
>                 URL: https://issues.apache.org/jira/browse/TIKA-3970
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 2.7.0
>            Reporter: David Avant
>            Priority: Minor
>         Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, 
> lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, 
> lyrics.txt
>
>
> Extracting text from certain OneNote documents produces more text than is 
> actually in the document.     In this case, the OneNote document was created 
> by opening a Word document and "printing" it to the OneNote.    
> To reproduce the issue, open the attached "lyrics.one" using the Tika App 
> version 2.7.0 and view the plain text.     Look for the phrase "Sunday 
> Morning" and observe that there are 14 occurrences.    However in the actual 
> displayed text, it occurs only once.      
> The original text in this document is only about 12K characters, but the 
> extracted text from tika is over 300K.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text

Reply via email to