Tim Allison created TIKA-1755:
---------------------------------

             Summary: Make ppt and pptx paragraph/div breaks more consistent
                 Key: TIKA-1755
                 URL: https://issues.apache.org/jira/browse/TIKA-1755
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison
            Priority: Minor


In working on [~kiwiwings]'s patch for the new handling of PPT/X, I found that 
our PPT/PPTX parsers behave very differently with <p> and <div> breaks, 
especially now that we've applied the upgrades from TIKA-1707.

I propose adding quite a few more <p> to capture the sentence/bullet level 
breaks in PPTX as we're now doing for PPT.

There are a handful of other things that we could clean up (table handling) as 
well.

Some of these changes may be relevant to this 
[discussion|http://mail-archives.apache.org/mod_mbox/tika-dev/201306.mbox/%3ccal8pwky96_gkjmps6zxuoe7h7-byvpxjbktbuy1goku3skz...@mail.gmail.com%3E].
  [~shaie], any input?

Patch and example output to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to