[
https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692772#comment-17692772
]
Hudson commented on TIKA-3970:
------------------------------
SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #1036 (See
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/1036/])
fix TIKA-3970 (#975) (github:
[https://github.com/apache/tika/commit/dd72f6b68ae14e3cf85543807d6fb9c14777ddcd])
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/test-tika-3970-dupetext.one
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/PropertyValue.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/onenote/OneNoteParserTest.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/OneNoteTreeWalker.java
> Certain OneNote documents produce duplicate text
> ------------------------------------------------
>
> Key: TIKA-3970
> URL: https://issues.apache.org/jira/browse/TIKA-3970
> Project: Tika
> Issue Type: Bug
> Components: app
> Affects Versions: 2.7.0
> Reporter: David Avant
> Priority: Minor
> Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png,
> lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one,
> lyrics.txt
>
>
> Extracting text from certain OneNote documents produces more text than is
> actually in the document. In this case, the OneNote document was created
> by opening a Word document and "printing" it to the OneNote.
> To reproduce the issue, open the attached "lyrics.one" using the Tika App
> version 2.7.0 and view the plain text. Look for the phrase "Sunday
> Morning" and observe that there are 14 occurrences. However in the actual
> displayed text, it occurs only once.
> The original text in this document is only about 12K characters, but the
> extracted text from tika is over 300K.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)