David Avant created TIKA-3970:
---------------------------------

             Summary: Certain OneNote documents produce duplicate text
                 Key: TIKA-3970
                 URL: https://issues.apache.org/jira/browse/TIKA-3970
             Project: Tika
          Issue Type: Bug
          Components: app
    Affects Versions: 2.7.0
            Reporter: David Avant


Extracting text from certain OneNote documents produces more text than is 
actually in the document.     In this case, the OneNote document was created by 
opening a Word document and "printing" it to the OneNote.    

To reproduce the issue, open the attached "lyrics.one" using the Tika App 
version 2.7.0 and view the plain text.     Look for the phrase "Sunday Morning" 
and observe that there are 14 occurrences.    However in the actual displayed 
text, it occurs only once.      

The original text in this document is only about 12K characters, but the 
extracted text from tika is over 300K.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to