[
https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691782#comment-17691782
]
Tim Allison commented on TIKA-3970:
-----------------------------------
Finally logged into OneNote, and it isn't reading anything in the file... :P
Didier Stevens' onedump.py at least reports some embedded files:
1: 0x000183e8 .PNG 89504e47 0x0002215c 4525bb925975a4f06f112e7227b46d6a
2: 0x0003a698 .PNG 89504e47 0x0001b547 3dcab45de349fbc41e1dacfef4b8b96e
3: 0x00055c18 .PNG 89504e47 0x00027e68 d44b03d8bde57ded30b6f8865a38451a
4: 0x0007dab8 .PNG 89504e47 0x00021730 a6d0ea1d5fb8697cb0115cd63ca81d51
5: 0x0009f220 PK.. 504b0304 0x0011b4ba 6d4875afce179ff0b10c49fc953fc071
6: 0x001ba710 .PNG 89504e47 0x00027ae9 f9aa16b19d7dbd3acd234d283165297e
7: 0x001e2230 .PNG 89504e47 0x00024a87 ff620b9361bbda292b6da4a1a3f807f0
8: 0x00206cf0 .PNG 89504e47 0x00027d97 1c00ea316059ed139d09dfd8f652b407
9: 0x0022eac0 .PNG 89504e47 0x00025c02 d668cb3b977f23a4a9d3c267df8b22ed
> Certain OneNote documents produce duplicate text
> ------------------------------------------------
>
> Key: TIKA-3970
> URL: https://issues.apache.org/jira/browse/TIKA-3970
> Project: Tika
> Issue Type: Bug
> Components: app
> Affects Versions: 2.7.0
> Reporter: David Avant
> Priority: Minor
> Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png,
> lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one,
> lyrics.txt
>
>
> Extracting text from certain OneNote documents produces more text than is
> actually in the document. In this case, the OneNote document was created
> by opening a Word document and "printing" it to the OneNote.
> To reproduce the issue, open the attached "lyrics.one" using the Tika App
> version 2.7.0 and view the plain text. Look for the phrase "Sunday
> Morning" and observe that there are 14 occurrences. However in the actual
> displayed text, it occurs only once.
> The original text in this document is only about 12K characters, but the
> extracted text from tika is over 300K.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)