Depending on the actual purpose of page-based extraction, couldn't a filter based on counting line returns?
http://sourcefreedom.com On Sep 25, 2011 10:09 AM, "Rob Weir" <[email protected]> wrote: > On Sun, Sep 25, 2011 at 3:57 AM, Svante Schubert > <[email protected]> wrote: >> Am 24.09.2011 14:26, schrieb Rob Weir: >>> On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <[email protected]> wrote: >>>> Hi, >>>> >>>> I need to extract all text (header, footer, comments, endnote, etc) from an >>>> ODT document. I need to do it on a page by page basis. I'm aware that ODTs >>>> are basically structured by paragraphs and headings, but i'd like to know if >>>> there's a way to achieve what i need. >>>> >>>> Thanks a lot. >>>> >>> Good question. >>> >>> With WYSIWYG word processors, page numbers are calculated when the >>> document is loaded, based on your currently configured printer, font >>> metrics, etc. So there is nothing at the level of the ODF markup that >>> is a structural equivalent to a "page". ODF is similar to HTML in >>> this regard. It has paragraphs, tables, etc., but line breaks and >>> page breaks are calculated at runtime. >>> >>> However, starting in ODF 1.1, the format does allow an option for a >>> word processor to save "soft" page breaks in the document. This was >>> intended to help with accessibility tools, screen readers, etc. If >>> your word processor supports this (and many do) then you can try >>> looking for the <text:soft-page-break> element. This would >>> indicate where the pages broke in the word processor that last saved >>> the document. But there is no guarantee that every ODF document will >>> have soft page breaks. >>> >>> So in theory you could walk the document, looking for >>> <text:soft-page-break> and determine pages that way. >>> >> Rob already gave the answer on problematics and the solution. >> I would like to add the question, where to place the functionality to >> receive pages, for instance if the questioner would be willing to >> provide a patch? >> Certainly in the highest level of API, therefore in the Simple API (or >> DOC API), as those will be merged. >> >> Daisy or Devin you once implemented the text extraction for the complete >> document, right? >> org.odftoolkit.odfdom.incubator.doc.text.OdfEditableTextExtractor >> Is this as well accessible via the Simple API? I could not find it. >> > > org.odftoolkit.simple.common.TextExtractor > > But the problem is that there is there is not page element. A page is > only defined by what is between soft page breaks (taking into account > the implicit page start at the start of the document and the implicit > page end at the end of the document). But there is no parent in the > DOM that contains page content. > > I could imagine a synthetic parent "page" object that could be > returned by the navigation API, and could then give access to the > "contained" content of that page. But it would need to be read-only, > I think. Change the content of the page, inserting/deleting, even > changing the header/footer can effect the pagination. > > Something to consider: even without a page-oriented UI, we should > consider invalidating and removing all existing soft page breaks when > document content is modified, or at least give an easy method for a > programmer to do this if they wish. > > > -Rob > >> In this context, when I looked for the extraction functionality, I >> stumpled over the method getFooter()/getHeader(). >> You return those from the document without a context. But there might be >> multiple header/footer in a document. >> One pair for each master page style, therefore you need a context or >> your simplification is only a good guess, but sometimes wrong. >> >> Regards, >> Svante >>
