Re: Is there a way to extract text on a page basis from odt ?

Devin Han Sun, 25 Sep 2011 23:12:02 -0700

2011/9/25 Rob Weir <[email protected]>

> On Sun, Sep 25, 2011 at 10:30 AM, Wolf Halton <[email protected]>
> wrote:
> > Depending on the actual purpose of page-based extraction, couldn't a
> filter
> > based on counting line returns?
> >
>
> Word wrapping and line splitting are similar to page breaks.  Unless
> the user enter an explicit carriage return, the document doesn't know
> where one line ends and another beings.  The line breaks are
> calculated when the editor renders the page based on font metrics and
> page dimensions.
>
> Of course, if we had layout code in the ODF Toolkit, that would allow
> us to solve this problem, in theory.



Yes, layout should an import feature that ODF Toolkit needs cover. Not only
in text document, but also presentation document.


> But you still have
> complications.  For example, the fonts available to a process on the
> server might be different than those available to the document
> author's client.  Or the Toolkit code might be running on a "headless"
> server without any graphics context available.  But that shouldn't
> stop us from solving this where we can.
>
> -Rob
>
> > http://sourcefreedom.com
> > On Sep 25, 2011 10:09 AM, "Rob Weir" <[email protected]> wrote:
> >> On Sun, Sep 25, 2011 at 3:57 AM, Svante Schubert
> >> <[email protected]> wrote:
> >>> Am 24.09.2011 14:26, schrieb Rob Weir:
> >>>> On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <[email protected]> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I need to extract all text (header, footer, comments, endnote, etc)
> > from an
> >>>>> ODT document. I need to do it on a page by page basis. I'm aware that
> > ODTs
> >>>>> are basically structured by paragraphs and headings, but i'd like to
> > know if
> >>>>> there's a way to achieve what i need.
> >>>>>
> >>>>> Thanks a lot.
> >>>>>
> >>>> Good question.
> >>>>
> >>>> With WYSIWYG word processors, page numbers are calculated when the
> >>>> document is loaded, based on your currently configured printer, font
> >>>> metrics, etc.  So there is nothing at the level of the ODF markup that
> >>>> is a structural equivalent to a "page".  ODF is similar to HTML in
> >>>> this regard.  It has paragraphs, tables, etc., but line breaks and
> >>>> page breaks are calculated at runtime.
> >>>>
> >>>> However, starting in ODF 1.1, the format does allow an option for a
> >>>> word processor to save "soft" page breaks in the document.  This was
> >>>> intended to help with accessibility tools, screen readers, etc.  If
> >>>> your word processor supports this (and many do) then you can try
> >>>> looking for the <text:soft-page-break> element.  This would
> >>>> indicate where the pages broke in the word processor that last saved
> >>>> the document.  But there is no guarantee that every ODF document will
> >>>> have soft page breaks.
> >>>>
> >>>> So in theory you could walk the document, looking for
> >>>> <text:soft-page-break> and determine pages that way.
> >>>>
> >>> Rob already gave the answer on problematics and the solution.
> >>> I would like to add the question, where to place the functionality to
> >>> receive pages, for instance if the questioner would be willing to
> >>> provide a patch?
> >>> Certainly in the highest level of API, therefore in the Simple API (or
> >>> DOC API), as those will be merged.
> >>>
> >>> Daisy or Devin you once implemented the text extraction for the
> complete
> >>> document, right?
> >>> org.odftoolkit.odfdom.incubator.doc.text.OdfEditableTextExtractor
> >>> Is this as well accessible via the Simple API? I could not find it.
> >>>
> >>
> >> org.odftoolkit.simple.common.TextExtractor
> >>
> >> But the problem is that there is there is not page element. A page is
> >> only defined by what is between soft page breaks (taking into account
> >> the implicit page start at the start of the document and the implicit
> >> page end at the end of the document). But there is no parent in the
> >> DOM that contains page content.
> >>
> >> I could imagine a synthetic parent "page" object that could be
> >> returned by the navigation API, and could then give access to the
> >> "contained" content of that page. But it would need to be read-only,
> >> I think. Change the content of the page, inserting/deleting, even
> >> changing the header/footer can effect the pagination.
> >>
> >> Something to consider: even without a page-oriented UI, we should
> >> consider invalidating and removing all existing soft page breaks when
> >> document content is modified, or at least give an easy method for a
> >> programmer to do this if they wish.
> >>
> >>
> >> -Rob
> >>
> >>> In this context, when I looked for the extraction functionality, I
> >>> stumpled over the method getFooter()/getHeader().
> >>> You return those from the document without a context. But there might
> be
> >>> multiple header/footer in a document.
> >>> One pair for each master page style, therefore you need a context or
> >>> your simplification is only a good guess, but sometimes wrong.
> >>>
> >>> Regards,
> >>> Svante
> >>>
> >
>



-- 
-Devin

Re: Is there a way to extract text on a page basis from odt ?

Reply via email to