I think it is clear by now that the original question is not well-defined. There are too many unanswered questions concerning pagination, how pages are referenced by number, and the headers/footers that also apply to that page. If one needs to match the text flow using hard line breaks, it is even trickier. And paragraphing is not all that easy either (especially once there are lists, auto-numbering, frames of material, etc.).
It seems best to revisit the problem statement and extract a grounded case: What is the problem that needs to be solved; what are the constraints on an acceptable solutions. Ram, can you please say more about the problem you want to solve? What would be the simplest-acceptable result? - Dennis MORE THOUGHTS Returning to the stratosphere, perhaps the closest case that could be trimmed down to what is needed for this might be an export-as-PDF filter, since that grounds the presentation of the document. This reminds me of a related problem: producing formatted *text* documents from word-processing documents. That becomes more like printing to a file by a filter, as Wolf suggested. I remember wanting to force that to happen in various word-processor packages. One example is to produce a properly-formatted IETF RFC text document. I remember a workstation publishing system that had good filters for this and the reverse, importing print-layout text documents and creating word-processing documents. The name escapes me. The company died with the advent of Macintosh and Windows PCs. It could also be done by a parametric print driver too, but that is a platform-dependent solution. The problem is that a print driver might not have any way to coordinate document layout and interior page structure, just paper sizes, non-printable areas, and some limited characteristics for marking features and character codes. I just managed to install a Generic / Text Only printer as a print-to-file device on Windows 7. It took a bit of looking around and I found it. It even ran the standard Printer Test Page successfully (with no graphics or colors, just the text with page margins and a fixed line-width. But here the game is still to do the actual rendering, but to the equivalent of an ASCII page, as forced by the parameters of the print driver visible to the ODF consumer. -----Original Message----- From: Rob Weir [mailto:[email protected]] Sent: Sunday, September 25, 2011 07:39 To: [email protected] Subject: Re: Is there a way to extract text on a page basis from odt ? On Sun, Sep 25, 2011 at 10:30 AM, Wolf Halton <[email protected]> wrote: > Depending on the actual purpose of page-based extraction, couldn't a filter > based on counting line returns? > Word wrapping and line splitting are similar to page breaks. Unless the user enter an explicit carriage return, the document doesn't know where one line ends and another beings. The line breaks are calculated when the editor renders the page based on font metrics and page dimensions. Of course, if we had layout code in the ODF Toolkit, that would allow us to solve this problem, in theory. But you still have complications. For example, the fonts available to a process on the server might be different than those available to the document author's client. Or the Toolkit code might be running on a "headless" server without any graphics context available. But that shouldn't stop us from solving this where we can. -Rob [ ... ]
