[ 
https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691782#comment-17691782
 ] 

Tim Allison commented on TIKA-3970:
-----------------------------------

Finally logged into OneNote, and it isn't reading anything in the file... :P

Didier Stevens' onedump.py at least reports some embedded files:

1: 0x000183e8 .PNG 89504e47 0x0002215c 4525bb925975a4f06f112e7227b46d6a
 2: 0x0003a698 .PNG 89504e47 0x0001b547 3dcab45de349fbc41e1dacfef4b8b96e
 3: 0x00055c18 .PNG 89504e47 0x00027e68 d44b03d8bde57ded30b6f8865a38451a
 4: 0x0007dab8 .PNG 89504e47 0x00021730 a6d0ea1d5fb8697cb0115cd63ca81d51
 5: 0x0009f220 PK.. 504b0304 0x0011b4ba 6d4875afce179ff0b10c49fc953fc071
 6: 0x001ba710 .PNG 89504e47 0x00027ae9 f9aa16b19d7dbd3acd234d283165297e
 7: 0x001e2230 .PNG 89504e47 0x00024a87 ff620b9361bbda292b6da4a1a3f807f0
 8: 0x00206cf0 .PNG 89504e47 0x00027d97 1c00ea316059ed139d09dfd8f652b407
 9: 0x0022eac0 .PNG 89504e47 0x00025c02 d668cb3b977f23a4a9d3c267df8b22ed

> Certain OneNote documents produce duplicate text
> ------------------------------------------------
>
>                 Key: TIKA-3970
>                 URL: https://issues.apache.org/jira/browse/TIKA-3970
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 2.7.0
>            Reporter: David Avant
>            Priority: Minor
>         Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, 
> lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, 
> lyrics.txt
>
>
> Extracting text from certain OneNote documents produces more text than is 
> actually in the document.     In this case, the OneNote document was created 
> by opening a Word document and "printing" it to the OneNote.    
> To reproduce the issue, open the attached "lyrics.one" using the Tika App 
> version 2.7.0 and view the plain text.     Look for the phrase "Sunday 
> Morning" and observe that there are 14 occurrences.    However in the actual 
> displayed text, it occurs only once.      
> The original text in this document is only about 12K characters, but the 
> extracted text from tika is over 300K.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to