Re: Is there a way to extract text on a page basis from odt ?

Wolf Halton Sun, 25 Sep 2011 07:31:07 -0700

Depending on the actual purpose of page-based extraction, couldn't a filter
based on counting line returns?


http://sourcefreedom.com
On Sep 25, 2011 10:09 AM, "Rob Weir" <[email protected]> wrote:
> On Sun, Sep 25, 2011 at 3:57 AM, Svante Schubert
> <[email protected]> wrote:
>> Am 24.09.2011 14:26, schrieb Rob Weir:
>>> On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> I need to extract all text (header, footer, comments, endnote, etc)
from an
>>>> ODT document. I need to do it on a page by page basis. I'm aware that
ODTs
>>>> are basically structured by paragraphs and headings, but i'd like to
know if
>>>> there's a way to achieve what i need.
>>>>
>>>> Thanks a lot.
>>>>
>>> Good question.
>>>
>>> With WYSIWYG word processors, page numbers are calculated when the
>>> document is loaded, based on your currently configured printer, font
>>> metrics, etc.  So there is nothing at the level of the ODF markup that
>>> is a structural equivalent to a "page".  ODF is similar to HTML in
>>> this regard.  It has paragraphs, tables, etc., but line breaks and
>>> page breaks are calculated at runtime.
>>>
>>> However, starting in ODF 1.1, the format does allow an option for a
>>> word processor to save "soft" page breaks in the document.  This was
>>> intended to help with accessibility tools, screen readers, etc.  If
>>> your word processor supports this (and many do) then you can try
>>> looking for the <text:soft-page-break> element.  This would
>>> indicate where the pages broke in the word processor that last saved
>>> the document.  But there is no guarantee that every ODF document will
>>> have soft page breaks.
>>>
>>> So in theory you could walk the document, looking for
>>> <text:soft-page-break> and determine pages that way.
>>>
>> Rob already gave the answer on problematics and the solution.
>> I would like to add the question, where to place the functionality to
>> receive pages, for instance if the questioner would be willing to
>> provide a patch?
>> Certainly in the highest level of API, therefore in the Simple API (or
>> DOC API), as those will be merged.
>>
>> Daisy or Devin you once implemented the text extraction for the complete
>> document, right?
>> org.odftoolkit.odfdom.incubator.doc.text.OdfEditableTextExtractor
>> Is this as well accessible via the Simple API? I could not find it.
>>
>
> org.odftoolkit.simple.common.TextExtractor
>
> But the problem is that there is there is not page element. A page is
> only defined by what is between soft page breaks (taking into account
> the implicit page start at the start of the document and the implicit
> page end at the end of the document). But there is no parent in the
> DOM that contains page content.
>
> I could imagine a synthetic parent "page" object that could be
> returned by the navigation API, and could then give access to the
> "contained" content of that page. But it would need to be read-only,
> I think. Change the content of the page, inserting/deleting, even
> changing the header/footer can effect the pagination.
>
> Something to consider: even without a page-oriented UI, we should
> consider invalidating and removing all existing soft page breaks when
> document content is modified, or at least give an easy method for a
> programmer to do this if they wish.
>
>
> -Rob
>
>> In this context, when I looked for the extraction functionality, I
>> stumpled over the method getFooter()/getHeader().
>> You return those from the document without a context. But there might be
>> multiple header/footer in a document.
>> One pair for each master page style, therefore you need a context or
>> your simplification is only a good guess, but sometimes wrong.
>>
>> Regards,
>> Svante
>>

Re: Is there a way to extract text on a page basis from odt ?

Reply via email to