2011/9/25 Svante Schubert <[email protected]>
> Am 24.09.2011 14:26, schrieb Rob Weir:
> > On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <[email protected]> wrote:
> >> Hi,
> >>
> >> I need to extract all text (header, footer, comments, endnote, etc) from
> an
> >> ODT document. I need to do it on a page by page basis. I'm aware that
> ODTs
> >> are basically structured by paragraphs and headings, but i'd like to
> know if
> >> there's a way to achieve what i need.
> >>
> >> Thanks a lot.
> >>
> > Good question.
> >
> > With WYSIWYG word processors, page numbers are calculated when the
> > document is loaded, based on your currently configured printer, font
> > metrics, etc. So there is nothing at the level of the ODF markup that
> > is a structural equivalent to a "page". ODF is similar to HTML in
> > this regard. It has paragraphs, tables, etc., but line breaks and
> > page breaks are calculated at runtime.
> >
> > However, starting in ODF 1.1, the format does allow an option for a
> > word processor to save "soft" page breaks in the document. This was
> > intended to help with accessibility tools, screen readers, etc. If
> > your word processor supports this (and many do) then you can try
> > looking for the <text:soft-page-break> element. This would
> > indicate where the pages broke in the word processor that last saved
> > the document. But there is no guarantee that every ODF document will
> > have soft page breaks.
> >
> > So in theory you could walk the document, looking for
> > <text:soft-page-break> and determine pages that way.
> >
> Rob already gave the answer on problematics and the solution.
> I would like to add the question, where to place the functionality to
> receive pages, for instance if the questioner would be willing to
> provide a patch?
> Certainly in the highest level of API, therefore in the Simple API (or
> DOC API), as those will be merged.
>
> Daisy or Devin you once implemented the text extraction for the complete
> document, right?
> org.odftoolkit.odfdom.incubator.doc.text.OdfEditableTextExtractor
> Is this as well accessible via the Simple API? I could not find it.
>
> In this context, when I looked for the extraction functionality, I
> stumpled over the method getFooter()/getHeader().
>
We supply the following, not thinking as simple as in your mind:
/**
* Get the Standard Page footer of this text document.
*
* @return the Standard Page footer of this text document.
* @since 0.4.5
*/
public Footer getFooter;
/**
* Get the footer of this text document.
*
* @param isFirstPage
* if <code>isFirstPage</code> is true, return the First Page
* footer, otherwise return Standard Page footer.
*
* @return the footer of this text document.
* @since 0.5
*/
public Footer getFooter(boolean isFirstPage);
/**
* Get the Standard Page header of this text document.
*
* @return the Standard Page header of this text document.
* @since 0.4.5
*/
public Header getHeader();
/**
* Get the header of this text document.
*
* @param isFirstPage
* if <code>isFirstPage</code> is true, return the First Page
* header, otherwise return Standard Page header.
*
* @return the header of this text document.
* @since 0.5
*/
public Header getHeader(boolean isFirstPage);
> You return those from the document without a context. But there might be
> multiple header/footer in a document.
> One pair for each master page style, therefore you need a context or
> your simplification is only a good guess, but sometimes wrong.
>
> Regards,
> Svante
>
--
-Devin