2011/9/25 Rob Weir <[email protected]> > On Sun, Sep 25, 2011 at 10:30 AM, Wolf Halton <[email protected]> > wrote: > > Depending on the actual purpose of page-based extraction, couldn't a > filter > > based on counting line returns? > > > > Word wrapping and line splitting are similar to page breaks. Unless > the user enter an explicit carriage return, the document doesn't know > where one line ends and another beings. The line breaks are > calculated when the editor renders the page based on font metrics and > page dimensions. > > Of course, if we had layout code in the ODF Toolkit, that would allow > us to solve this problem, in theory.
Yes, layout should an import feature that ODF Toolkit needs cover. Not only in text document, but also presentation document. > But you still have > complications. For example, the fonts available to a process on the > server might be different than those available to the document > author's client. Or the Toolkit code might be running on a "headless" > server without any graphics context available. But that shouldn't > stop us from solving this where we can. > > -Rob > > > http://sourcefreedom.com > > On Sep 25, 2011 10:09 AM, "Rob Weir" <[email protected]> wrote: > >> On Sun, Sep 25, 2011 at 3:57 AM, Svante Schubert > >> <[email protected]> wrote: > >>> Am 24.09.2011 14:26, schrieb Rob Weir: > >>>> On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <[email protected]> wrote: > >>>>> Hi, > >>>>> > >>>>> I need to extract all text (header, footer, comments, endnote, etc) > > from an > >>>>> ODT document. I need to do it on a page by page basis. I'm aware that > > ODTs > >>>>> are basically structured by paragraphs and headings, but i'd like to > > know if > >>>>> there's a way to achieve what i need. > >>>>> > >>>>> Thanks a lot. > >>>>> > >>>> Good question. > >>>> > >>>> With WYSIWYG word processors, page numbers are calculated when the > >>>> document is loaded, based on your currently configured printer, font > >>>> metrics, etc. So there is nothing at the level of the ODF markup that > >>>> is a structural equivalent to a "page". ODF is similar to HTML in > >>>> this regard. It has paragraphs, tables, etc., but line breaks and > >>>> page breaks are calculated at runtime. > >>>> > >>>> However, starting in ODF 1.1, the format does allow an option for a > >>>> word processor to save "soft" page breaks in the document. This was > >>>> intended to help with accessibility tools, screen readers, etc. If > >>>> your word processor supports this (and many do) then you can try > >>>> looking for the <text:soft-page-break> element. This would > >>>> indicate where the pages broke in the word processor that last saved > >>>> the document. But there is no guarantee that every ODF document will > >>>> have soft page breaks. > >>>> > >>>> So in theory you could walk the document, looking for > >>>> <text:soft-page-break> and determine pages that way. > >>>> > >>> Rob already gave the answer on problematics and the solution. > >>> I would like to add the question, where to place the functionality to > >>> receive pages, for instance if the questioner would be willing to > >>> provide a patch? > >>> Certainly in the highest level of API, therefore in the Simple API (or > >>> DOC API), as those will be merged. > >>> > >>> Daisy or Devin you once implemented the text extraction for the > complete > >>> document, right? > >>> org.odftoolkit.odfdom.incubator.doc.text.OdfEditableTextExtractor > >>> Is this as well accessible via the Simple API? I could not find it. > >>> > >> > >> org.odftoolkit.simple.common.TextExtractor > >> > >> But the problem is that there is there is not page element. A page is > >> only defined by what is between soft page breaks (taking into account > >> the implicit page start at the start of the document and the implicit > >> page end at the end of the document). But there is no parent in the > >> DOM that contains page content. > >> > >> I could imagine a synthetic parent "page" object that could be > >> returned by the navigation API, and could then give access to the > >> "contained" content of that page. But it would need to be read-only, > >> I think. Change the content of the page, inserting/deleting, even > >> changing the header/footer can effect the pagination. > >> > >> Something to consider: even without a page-oriented UI, we should > >> consider invalidating and removing all existing soft page breaks when > >> document content is modified, or at least give an easy method for a > >> programmer to do this if they wish. > >> > >> > >> -Rob > >> > >>> In this context, when I looked for the extraction functionality, I > >>> stumpled over the method getFooter()/getHeader(). > >>> You return those from the document without a context. But there might > be > >>> multiple header/footer in a document. > >>> One pair for each master page style, therefore you need a context or > >>> your simplification is only a good guess, but sometimes wrong. > >>> > >>> Regards, > >>> Svante > >>> > > > -- -Devin
