David Avant created TIKA-3970:
---------------------------------
Summary: Certain OneNote documents produce duplicate text
Key: TIKA-3970
URL: https://issues.apache.org/jira/browse/TIKA-3970
Project: Tika
Issue Type: Bug
Components: app
Affects Versions: 2.7.0
Reporter: David Avant
Extracting text from certain OneNote documents produces more text than is
actually in the document. In this case, the OneNote document was created by
opening a Word document and "printing" it to the OneNote.
To reproduce the issue, open the attached "lyrics.one" using the Tika App
version 2.7.0 and view the plain text. Look for the phrase "Sunday Morning"
and observe that there are 14 occurrences. However in the actual displayed
text, it occurs only once.
The original text in this document is only about 12K characters, but the
extracted text from tika is over 300K.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)