[ https://issues.apache.org/jira/browse/TIKA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-1755. ------------------------------- Resolution: Fixed r1707432 > Make ppt and pptx paragraph/div breaks more consistent > ------------------------------------------------------ > > Key: TIKA-1755 > URL: https://issues.apache.org/jira/browse/TIKA-1755 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > Attachments: TIKA-1755.patch > > > In working on [~kiwiwings]'s patch for the new handling of PPT/X, I found > that our PPT/PPTX parsers behave very differently with <p> and <div> breaks, > especially now that we've applied the upgrades from TIKA-1707. > I propose adding quite a few more <p> to capture the sentence/bullet level > breaks in PPTX as we're now doing for PPT. > There are a handful of other things that we could clean up (table handling) > as well. > Some of these changes may be relevant to this > [discussion|http://mail-archives.apache.org/mod_mbox/tika-dev/201306.mbox/%3ccal8pwky96_gkjmps6zxuoe7h7-byvpxjbktbuy1goku3skz...@mail.gmail.com%3E]. > [~shaie], any input? > Patch and example output to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)