RE: Is there a way to extract text on a page basis from odt ?

Dennis E. Hamilton Sun, 25 Sep 2011 09:37:19 -0700

I think it is clear by now that the original question is not 
well-defined.  There are too many unanswered questions concerning
pagination, how pages are referenced by number, and the 
headers/footers that also apply to that page. If one needs to match
the text flow using hard line breaks, it is even trickier.  And
paragraphing is not all that easy either (especially once there
are lists, auto-numbering, frames of material, etc.).

It seems best to revisit the problem statement and extract a
grounded case: What is the problem that needs to be solved;
what are the constraints on an acceptable solutions.

Ram, can you please say more about the problem you want to solve?
What would be the simplest-acceptable result?

 - Dennis

MORE THOUGHTS

Returning to the stratosphere, perhaps the closest case that could 
be trimmed down to what is needed for this might be an export-as-PDF
filter, since that grounds the presentation of the document.

This reminds me of a related problem: producing formatted 
*text* documents from word-processing documents.  That becomes more
like printing to a file by a filter, as Wolf suggested.  I remember 
wanting to force that to happen in various word-processor packages. 
 One example is to produce a properly-formatted IETF RFC text 
document.

I remember a workstation publishing system that had good filters 
for this and the reverse, importing print-layout text documents 
and creating word-processing documents.  The name escapes me.
The company died with the advent of Macintosh and Windows PCs.

It could also be done by a parametric print driver too, but that is
a platform-dependent solution. The problem is that a print driver
might not have any way to coordinate document layout and interior
page structure, just paper sizes, non-printable areas, and some
limited characteristics for marking features and character codes.

I just managed to install a Generic / Text Only printer as a 
print-to-file device on Windows 7.  It took a bit of looking around
and I found it. It even ran the standard Printer Test Page 
successfully (with no graphics or colors, just the text with page
margins and a fixed line-width.  But here the game is still to do
the actual rendering, but to the equivalent of an ASCII page, as
forced by the parameters of the print driver visible to the ODF
consumer.

-----Original Message-----
From: Rob Weir [mailto:[email protected]] 
Sent: Sunday, September 25, 2011 07:39
To: [email protected]
Subject: Re: Is there a way to extract text on a page basis from odt ?

On Sun, Sep 25, 2011 at 10:30 AM, Wolf Halton <[email protected]> wrote:
> Depending on the actual purpose of page-based extraction, couldn't a filter
> based on counting line returns?
>

Word wrapping and line splitting are similar to page breaks.  Unless
the user enter an explicit carriage return, the document doesn't know
where one line ends and another beings.  The line breaks are
calculated when the editor renders the page based on font metrics and
page dimensions.

Of course, if we had layout code in the ODF Toolkit, that would allow
us to solve this problem, in theory.  But you still have
complications.  For example, the fonts available to a process on the
server might be different than those available to the document
author's client.  Or the Toolkit code might be running on a "headless"
server without any graphics context available.  But that shouldn't
stop us from solving this where we can.

-Rob

[ ... ]

RE: Is there a way to extract text on a page basis from odt ?

Reply via email to