On Sat, Sep 24, 2011 at 4:22 AM, Ram Kane <[email protected]> wrote: > Hi, > > I need to extract all text (header, footer, comments, endnote, etc) from an > ODT document. I need to do it on a page by page basis. I'm aware that ODTs > are basically structured by paragraphs and headings, but i'd like to know if > there's a way to achieve what i need. > > Thanks a lot. >
Good question. With WYSIWYG word processors, page numbers are calculated when the document is loaded, based on your currently configured printer, font metrics, etc. So there is nothing at the level of the ODF markup that is a structural equivalent to a "page". ODF is similar to HTML in this regard. It has paragraphs, tables, etc., but line breaks and page breaks are calculated at runtime. However, starting in ODF 1.1, the format does allow an option for a word processor to save "soft" page breaks in the document. This was intended to help with accessibility tools, screen readers, etc. If your word processor supports this (and many do) then you can try looking for the <text:soft-page-break> element. This would indicate where the pages broke in the word processor that last saved the document. But there is no guarantee that every ODF document will have soft page breaks. So in theory you could walk the document, looking for <text:soft-page-break> and determine pages that way. -Rob
