[
https://issues.apache.org/jira/browse/TIKA-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638585#comment-16638585
]
Tim Allison commented on TIKA-2735:
-----------------------------------
On the master branch, I just added configurability for allowing the user to
turn off extraction from notes sections and from master sections. There are
three types of masters: master slide, master notes, master handout. I think
one variable should handle all of those.
There are some new unit tests that aren't passing, and I can't figure out if
this is user error, a bug in POI or a happenstance of how the documents were
generated.
I also cleaned up, and I think, improved extraction from the notes section in
ppt.
IMHO, these changes are too big to make it into 1.19.1, but they should be ok
(after large scale regression tests) to go into 1.20.
> notes and footer contents are duplicated in extracting text from power point
> slides
> -----------------------------------------------------------------------------------
>
> Key: TIKA-2735
> URL: https://issues.apache.org/jira/browse/TIKA-2735
> Project: Tika
> Issue Type: Bug
> Components: handler
> Affects Versions: 1.18
> Reporter: feng ye
> Priority: Major
> Attachments: Oneslide.ppt, pptTextResults.txt
>
>
> notes and footer contents are duplicated at the end when extract text from
> ppt slides (like the one in the attachment). Both the input file and the text
> results are attached.
> Is there a configuration option that can be used to suppress this kind of
> duplication?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)