Sam H created TIKA-1841:
---------------------------
Summary: Different XML output structure for PPT and PPTX
Key: TIKA-1841
URL: https://issues.apache.org/jira/browse/TIKA-1841
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.11
Reporter: Sam H
Issue is slightly related to TIKA-1840
I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is
different.
The structure for PPTX seems as follows:
{code}
<div class="slide-content"></div>
<div class="slide-master-content" />
<div class="slide-notes"></div> //optional
<div class="slide-comment"></div> //optional
...
<div class="slide-content"></div>
<div class="slide-master-content" />
<div class="slide-notes"></div> //optional
<div class="slide-comment"></div> //optional
{code}
Note that there's no parent slide element to indicate the start and end of each
slide.
For powerpoint the structure is as follows:
{code}
<div class="slideShow">
<div class="slide">
<div class="slide-master-content"></div>
<div class="slide-content"></div>
<div class="slide-notes"></div> //added in TIKA-1840
<div class="slide-comment"></div>
</div>
...
<div class="slide">
<div class="slide-master-content"></div>
<div class="slide-content"></div>
<div class="slide-notes"></div> //added in TIKA-1840
<div class="slide-comment"></div>
</div>
</div>
<div class="slide-notes">
{code}
In my application, I'm using XPath to get the desired information . As the XML
structure is different, I have to differentiate my XPath queries whether the
file is PPT (old) or PPTX (new). It would be nice for Tika to return the same
XML for both.
I would propose changing the XML structure to this:
{code}
<div class="slideShow">
<div class="slide">
<div class="slide-master-content"></div>
<div class="slide-content"></div>
<div class="slide-notes"></div> //added in TIKA-1840
<div class="slide-comment"></div>
</div>
...
<div class="slide">
<div class="slide-master-content"></div>
<div class="slide-content"></div>
<div class="slide-notes"></div> //added in TIKA-1840
<div class="slide-comment"></div>
</div>
</div>
{code}
So, essentially, like the current PPT output, but without the list of notes at
the end (as this is also omitted for PPTX).
On the one hand this generalizes PPT(X) handling, on the other it can break
existing (external) functionality relying on a specific XML output format.
I don't know if this is something the project wants fixed or not. If so, I'm
willing to donate my time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)