[
https://issues.apache.org/jira/browse/TIKA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1755:
------------------------------
Attachment: TIKA-1755.patch
Initial patch
> Make ppt and pptx paragraph/div breaks more consistent
> ------------------------------------------------------
>
> Key: TIKA-1755
> URL: https://issues.apache.org/jira/browse/TIKA-1755
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: TIKA-1755.patch
>
>
> In working on [~kiwiwings]'s patch for the new handling of PPT/X, I found
> that our PPT/PPTX parsers behave very differently with <p> and <div> breaks,
> especially now that we've applied the upgrades from TIKA-1707.
> I propose adding quite a few more <p> to capture the sentence/bullet level
> breaks in PPTX as we're now doing for PPT.
> There are a handful of other things that we could clean up (table handling)
> as well.
> Some of these changes may be relevant to this
> [discussion|http://mail-archives.apache.org/mod_mbox/tika-dev/201306.mbox/%3ccal8pwky96_gkjmps6zxuoe7h7-byvpxjbktbuy1goku3skz...@mail.gmail.com%3E].
> [~shaie], any input?
> Patch and example output to follow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)