Re: Is there a way to extract text on a page basis from odt ?

Devin Han Sun, 25 Sep 2011 23:05:26 -0700

2011/9/25 Svante Schubert <[email protected]>

> Am 24.09.2011 14:26, schrieb Rob Weir:
> > On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <[email protected]> wrote:
> >> Hi,
> >>
> >> I need to extract all text (header, footer, comments, endnote, etc) from
> an
> >> ODT document. I need to do it on a page by page basis. I'm aware that
> ODTs
> >> are basically structured by paragraphs and headings, but i'd like to
> know if
> >> there's a way to achieve what i need.
> >>
> >> Thanks a lot.
> >>
> > Good question.
> >
> > With WYSIWYG word processors, page numbers are calculated when the
> > document is loaded, based on your currently configured printer, font
> > metrics, etc.  So there is nothing at the level of the ODF markup that
> > is a structural equivalent to a "page".  ODF is similar to HTML in
> > this regard.  It has paragraphs, tables, etc., but line breaks and
> > page breaks are calculated at runtime.
> >
> > However, starting in ODF 1.1, the format does allow an option for a
> > word processor to save "soft" page breaks in the document.  This was
> > intended to help with accessibility tools, screen readers, etc.  If
> > your word processor supports this (and many do) then you can try
> > looking for the <text:soft-page-break> element.  This would
> > indicate where the pages broke in the word processor that last saved
> > the document.  But there is no guarantee that every ODF document will
> > have soft page breaks.
> >
> > So in theory you could walk the document, looking for
> > <text:soft-page-break> and determine pages that way.
> >
> Rob already gave the answer on problematics and the solution.
> I would like to add the question, where to place the functionality to
> receive pages, for instance if the questioner would be willing to
> provide a patch?
> Certainly in the highest level of API, therefore in the Simple API (or
> DOC API), as those will be merged.
>
> Daisy or Devin you once implemented the text extraction for the complete
> document, right?
> org.odftoolkit.odfdom.incubator.doc.text.OdfEditableTextExtractor
> Is this as well accessible via the Simple API? I could not find it.
>
> In this context, when I looked for the extraction functionality, I
> stumpled over the method getFooter()/getHeader().
>


We supply the following, not thinking as simple as in your mind:

   /**
     * Get the Standard Page footer of this text document.
     *
     * @return the Standard Page footer of this text document.
     * @since 0.4.5
     */
    public Footer getFooter;

    /**
     * Get the footer of this text document.
     *
     * @param isFirstPage
     *            if <code>isFirstPage</code> is true, return the First Page
     *            footer, otherwise return Standard Page footer.
     *
     * @return the footer of this text document.
     * @since 0.5
     */
    public Footer getFooter(boolean isFirstPage);

   /**
     * Get the Standard Page header of this text document.
     *
     * @return the Standard Page header of this text document.
     * @since 0.4.5
     */
    public Header getHeader();

    /**
     * Get the header of this text document.
     *
     * @param isFirstPage
     *            if <code>isFirstPage</code> is true, return the First Page
     *            header, otherwise return Standard Page header.
     *
     * @return the header of this text document.
     * @since 0.5
     */
    public Header getHeader(boolean isFirstPage);



> You return those from the document without a context. But there might be
> multiple header/footer in a document.
> One pair for each master page style, therefore you need a context or
> your simplification is only a good guess, but sometimes wrong.
>
> Regards,
> Svante
>



-- 
-Devin

Re: Is there a way to extract text on a page basis from odt ?

Reply via email to