[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

Sam H (JIRA) Mon, 15 Feb 2016 07:03:02 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147446#comment-15147446
 ]


Sam H commented on TIKA-1841:
-----------------------------

So I've made some progress regarding this issue. This is the output I've 
achieved so far:

{code:title=PPT|linenumbers=true|language=html/xml}
<div class="slideShow">
        <div class="slide">
                <div class="slide-master-content">
                        <p>master content text</p>
                </div>
                <div class="slide-content">
                        <p>Slide title</p>
                        <p>This is the slide footer</p>
                        <p>This is body text</p>
                </div>
                <div class="slide-notes">
                        <p>general slide note</p>
                        <p>This is the footer specifically for notes and 
handouts</p>
                        <p>This is the header specifically for notes and 
handouts</p>
                </div>
                <div class="slide-comments">
                        <p><b>Author (A) - </b>This is a comment</p>
                </div>
        </div>
</div>
{code}

In the PPT format, the slide headers/footers and note header/footers are 
included in the general response. They are returned from the function 
{{textRunsToText()}} in {{HSLFExtractor.java}}. I see no way to filter these 
out. 

It IS possible to get these values seperately, trough 
{{slide.getHeadersFooters();}}, but that would mean duplicated content. (But 
adds semantic value). 

In PPTX, I'm unable to get these values easily. It is possible though to use 
class annotations to add semantic value, like below:

{code:title=PPTX|linenumbers=true|language=html/xml}
<div class="slideShow">
        <div class="slide">
                <div class="slide-master-content">
                        <p>master content text</p>
                </div>
                <div class="slide-content">
                        <p class="slide-title">Slide title</p>
                        <p>This is body text</p>
                        <p class="slide-footer">This is the slide footer</p>
                </div>
                <div class="slide-notes">
                        <p>general slide note</p>
                        <p class="slide-note-footer">This is the footer 
specifically for notes and handouts</p>
                        <p class="slide-note-header">This is the header 
specifically for notes and handouts</p>
                </div>
                <div class="slide-comments">
                        <p><b>Author (A) - </b>This is a comment</p>
                </div>
        </div>
</div>
{code}

The proposed PPTX structure seems decent to me. The question is if I should add 
the slide-footer / slide-notes-footer redundantly (but with semantic tagging) 
to the PPT output, or not?

> Different XML output structure for PPT and PPTX
> -----------------------------------------------
>
>                 Key: TIKA-1841
>                 URL: https://issues.apache.org/jira/browse/TIKA-1841
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Sam H
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is 
> different. 
> The structure for PPTX seems as follows:
> {code}
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> ...
> <div class="slide-content"></div>
> <div class="slide-master-content" />
> <div class="slide-notes"></div> //optional
> <div class="slide-comment"></div> //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of 
> each slide.
> For powerpoint the structure is as follows:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div> 
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> <div class="slide-notes">
> {code}
> In my application, I'm using XPath to get the desired information . As the 
> XML structure is different, I have to differentiate my XPath queries whether 
> the file is PPT (old) or PPTX (new). It would be nice for Tika to return the 
> same XML for both.
> I would propose changing the XML structure to this:
> {code}
> <div class="slideShow">
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div> 
>   </div>
>   ...
>   <div class="slide">
>     <div class="slide-master-content"></div>
>     <div class="slide-content"></div>
>     <div class="slide-notes"></div> //added in TIKA-1840
>     <div class="slide-comment"></div>
>   </div>
> </div>
> {code}
> So, essentially, like the current PPT output, but without the list of notes 
> at the end (as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break 
> existing (external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm 
> willing to donate my time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

Reply via email to