Gordon Vidal created TIKA-3828:
----------------------------------
Summary: OneNote Parser - Parsed Files are Missing Parts of the
Content
Key: TIKA-3828
URL: https://issues.apache.org/jira/browse/TIKA-3828
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.28.4, 2.4.1
Reporter: Gordon Vidal
Attachments: TestSection1 (1).one, TikaParserErrorScreenshot.png
OneNote files that I receive from Sharepoint Online are currently not parsed
correctly. See the attached screenshot and OneNote section file.
I have been able to consistently reproduce this issue doing the following:
* Create a OneNote Document with multiple sections.
* Edit the OneNote Document using the option "Open in Desktop App" and make
changes in different sections, saving between edits. I have used both OneNote
2016 (Version 1808) and OneNote 2021 (Version 2108).
* Download a section of the OneNote Document using the Sharepoint Online REST
API
I will be investigating this issue myself as well. The Tika codebase is quite
new to me so any information about the status of this bug, the potential cause
and any plans to fix it would be very welcome.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)