Re: Alt-Design status: XML handling

2002-11-29 Thread Peter B. West
Bertrand Delacretaz wrote:

Great work Peter!
It makes a lot of sense to use higher-level than SAX events, and thanks for 
explaining this so clearly.

If you allow me a suggestion regarding the structure of the code: maybe using 
some table-driven stuff instead of the many if statements in 
FoSimplePageMaster would be more readable?

Something like:

class EventHandler
{
  EventHandler(String regionName,boolean discardSpace,boolean required)
  ...
}

/** table of event handlers that must be applied, in order */
EventHandler [] handlers = {
  new EventHandler(FObjectNames.REGION_BODY,true,true),
  new EventHandler(FObjectNames.REGION_BEFORE,true,false)
};

...then, in FoSimplePageMaster(...) loop over handlers and let them process 
the events.

I don't know if this applies in general but it might be clearer to read and 
less risky to modify.

Bertrand,

Sorry this one slipped through the cracks.  Some such approach may be a 
good idea, but I would be loathe to call it EventHandler.   The whole 
point about pull parsing is to move away from event handling.  I would 
think of these more as methods with parameters like optional, single 
or multiple, any.

Peter
--
Peter B. West  [EMAIL PROTECTED]  http://www.powerup.com.au/~pbwest/
Lord, to whom shall we go?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



Re: Alt-Design status: XML handling

2002-11-27 Thread Oleg Tkachenko
Rhett Aultman wrote:

  But, a

pull model can be grafted onto a push model by implementing what amounts to
a specialized buffer of the pushed data that accepts pull queries...no?

Yes, another alternative is additional thread with the same duties. See
Aleksander Slominski's parer: 
http://www.extreme.indiana.edu/xgws/papers/xml_push_pull/node3.html

--
Oleg Tkachenko
eXperanto team
Multiconn Technologies, Israel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



RE: Alt-Design status: XML handling

2002-11-26 Thread Rhett Aultman
Responses below.

-Original Message-
From: Peter B. West [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, November 26, 2002 2:25 AM
To: [EMAIL PROTECTED]
Subject: Re: Alt-Design status: XML handling


This is not a problem for at least the maintenance version of the code. 
  All of the processing is triggered by incoming SAX events, and occurs 
within the SAX callbacks.  These are synchronous events, so the parsing 
stalls until the callback returns.  Page-sequence rendering, e.g., 
occurs within the endElement() callback of an fo:page-sequence element.


True...I did not take synchronous event handling into consideration, although I'm not 
entirely sure that synchronous event handling is, performance wise, entirely prudent 
either...though that's for different reasons.


 And, I believe, it might be wrong, though I must read the full source text.  The 
push model can be seen as a special case of a pull model in the sense of Pull 
everything ASAP, now and until the data is exhausted.  But, a pull model can be 
grafted onto a push model by implementing what amounts to a specialized buffer of the 
pushed data that accepts pull queries...no?
 

Which is what I have done.


Seems like a logical way to implement pull over push.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]




RE: Alt-Design status: XML handling

2002-11-26 Thread Arved Sandstrom
 -Original Message-
 From: Peter B. West [mailto:[EMAIL PROTECTED]]
 Sent: November 26, 2002 3:25 AM
 To: [EMAIL PROTECTED]
 Subject: Re: Alt-Design status: XML handling

 Rhett,

 To comment on only two aspects of your posting.

 Rhett Aultman wrote:
 
  -Original Message-
  From: Oleg Tkachenko [mailto:[EMAIL PROTECTED]]
 
  Generally, event-driven processing is a pretty good thing.  The
 critical issue with it, though, is the ratio of event production
 to event processing.  If that number is anything greater than 1,
 then more events are being produced in a stretch of time than can
 be effectively processed in that stretch of time.  Events start
 to queue up, taking up memory.  If it happens enough, the heap
 starts to get a little too full, the gc runs a little too much,
 and that causes processing time to suffer even further.  Under
 most circumstances, event-based processing is like using a garden
 hose to water a bed of flowers.  It works just fine.  Under more
 intense cases, though, it can be more like using a garden hose to
 fill a small container of water, then leaving the hose laying
 around (spilling water all over the lawn) while the container
 gets carried off somewhere.

Actually, it really matters where the events are coming from. An HTTP server
has no control over how many requests it gets, so your description above is
apt. But for FOP (disregarding FOPServlet) everything is one process - the
XML parser, the formatter, the renderer - so it's ultimately procedural;
there may be an internal boundary where an event/callback system is used,
but it's all one thread so nothing queues up at all.

The only reason to adopt your approach (and I am not saying I don't like it)
is because it's easier to understand.

Regards,
Arved


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]




RE: Alt-Design status: XML handling

2002-11-26 Thread Rhett Aultman
Responses below.

-Original Message- 
From: Arved Sandstrom [mailto:[EMAIL PROTECTED]] 
Sent: Tue 11/26/2002 6:42 PM 
To: [EMAIL PROTECTED] 
Cc: 
Subject: RE: Alt-Design status: XML handling

Actually, it really matters where the events are coming from. An HTTP server
has no control over how many requests it gets, so your description above is
apt. But for FOP (disregarding FOPServlet) everything is one process - the
XML parser, the formatter, the renderer - so it's ultimately procedural;
there may be an internal boundary where an event/callback system is used,
but it's all one thread so nothing queues up at all.


Yes...as I said, I caught myself off-guard because I tend to use an event 
model only when I need to multicast an event or when I need to be able to send events 
between two threads.  With the single thread you're describing, performance hits I 
describe aren't an issue.  There can be other issues there, but I really don't want to 
bother because I know they're not relevant.

Peter's case for not wanting event-driven is much more sound, and I have to 
say I agree with it.


winmail.dat-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]


Re: Alt-Design status: XML handling

2002-11-25 Thread Peter B. West
Oleg Tkachenko wrote:

Peter B. West wrote:


Why is it easier for developers to use?  Is it because the API is less 
complex or more easily understood?  Not really.  As you point out, the 
SAX API is not all that complex.  The problem is that the processing 
model of SUX is completely inverted.

Well, I believe it's more philosophical question or a question of a 
programming style. push vs pull, imperative languages vs declarative 
languages etc etc etc ancient holy war. One likes to define rules aka 
sax handlers, another likes to weave a web from if statements, only to 
be able to control processing order ;) Both pull and push have pros and 
contras and it's a pity java still doesn't have a full-fledged pull 
parsing API (btw, James Clark is working on StAX[1], so it's a matter of 
time).

I don't believe is is only a matter of style.  I think the detrimental 
effects of push for general programming are glaringly obvious.  That I 
think, rather than catering for simple-minded developers, is what 
motivated MS' abandonment of SAX.  I speak as a long-time anti-MS bigot.


 You may have come to like writing XSLT that way.


It's the only way to write non-hello-world stylesheets in xslt actually. 
Don't forget, xlst is a declarative language, so probably analogies with 
java are just irrelevant, they are different beasts. The question is 
what is good for the fo tree building stuff? Probably you right, pull is 
more suitable, but the bad thing is that real input is SAX stream hence 
we must translate push to pull (funny enough ms considers this task as 
unfeasible one in XMLReader documentation).

I haven't read the documentation, but it may be that they are referring 
to the infeasibility of moving code built around SAX to an XmlReader 
environment.

  Hence next question is the 
cost of your interim buffer, what do you think could be its peak and 
average size?

At the moment it is more expensive than it need be; there is no event 
pool.  I am writing one now.  It's fairly trivial, as you can imagine. 
The buffer is implemented as a circular buffer, currently of 128 
elements, but it has been set at 32, and 64 should be more than enough. 
 The circular buffer places an upper limit on the size, and 
synchronizes (in a broad sense) the activities of producer (parser) and 
consumer (tree builder.)

parser:
 until buffer full, write events to buffer
 notify
 wait

tree builder:
 wait
 until buffer empty, read events from buffer
 notify

In the SAX model, the throttle on parser throughput is the downstream 
processing that is immediately triggered by the start and end events 
generated by the parser.

In the buffered model, the throttle is the circular buffer and the waits 
 that occur on it.

Of course, as I have mentioned recently.  And as I also said, the cost 
of parsing relative to the intensive downstream element processing of 
FOP is small.

If so, isn't it too early to optimize xml handling altogether? What 
would we benefit from moving from push to pull? Well, sort of automatic 
validation is a benefit indeed, but I'm not sure it's enough.

This is not an optimisation, but a fundamental design decision.  It's 
all or nothing.  See the comments about the feasibility of moving from 
one model to the other.

The whole question is context-dependent.  If you are engaged in the 
peephole processing of SUX you may be obliged to use external 
validation.  With top-down processing you have more choice, because 
your context is travelling with you.

btw, what about unexpected content model objects? Will this fail?
fo:simple-page-master master-name=default
fo:region-body/
fo:block/
/fo:simple-page-master


Unexpected content models will throw an exception.  How that is handled 
is another question.  At the moment, while I am in a debugging phase, 
most exceptions just propagate up, but all the usual flexibility of the 
exception system is available for refinement.

Don't get me wrong here.  I'm not saying that external validation is 
wrong, merely that with a pull model, the need is reduced.  There may 
still be a strong case for it, but not as strong as with SUX.

You are right and that btw allows to make external validation optional 
and still to have reasonable level of validation for free.

[1] http://www.jcp.org/en/jsr/detail?id=173


It encourages me greatly that there is so much activity going on in this 
area.  Especially interesting is the Xerces XNI 
XMLPullParserConfiguration Interface.

Peter
--
Peter B. West  [EMAIL PROTECTED]  http://www.powerup.com.au/~pbwest/
Lord, to whom shall we go?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



Re: Alt-Design status: XML handling

2002-11-25 Thread Oleg Tkachenko
Peter B. West wrote:


I don't believe is is only a matter of style.  I think the detrimental 
effects of push for general programming are glaringly obvious.
It's just event-driven processing, how it could be detrimental?


I haven't read the documentation, but it may be that they are referring 
to the infeasibility of moving code built around SAX to an XmlReader 
environment.
It's in Comparing XmlReader to SAX Reader page[1]: The push model can 
be built on top of the pull model. The reverse is not true. Too 
categorical statement, I think.

This is not an optimisation, but a fundamental design decision.  It's 
all or nothing.  See the comments about the feasibility of moving from 
one model to the other.
If so, we need more opinions from others.

[1] 
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpguide/html/cpconcomparingxmlreadertosaxreader.asp 

--
Oleg Tkachenko
eXperanto team
Multiconn Technologies, Israel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



RE: Alt-Design status: XML handling

2002-11-25 Thread Rhett Aultman
Completely generalized and probably worthless response below. ;)

-Original Message-
From: Oleg Tkachenko [mailto:[EMAIL PROTECTED]]
Sent: Monday, November 25, 2002 4:01 PM
To: [EMAIL PROTECTED]
Subject: Re: Alt-Design status: XML handling


Peter B. West wrote:

 I don't believe is is only a matter of style.  I think the detrimental 
 effects of push for general programming are glaringly obvious.
It's just event-driven processing, how it could be detrimental?


I cannot speak for FOP, but I can speak in generalities about this.  The difference 
between event-based and pull-style is roughly the difference between using a garden 
hose and using a garden hose with one of those spray-gun nozzles on it.  In the case, 
the water keeps coming out of the hose, pretty much whether you want it to or not.  In 
the latter case, the water comes out only when you want it, but it requries effort on 
your behalf.  When to use each idea.

Generally, event-driven processing is a pretty good thing.  The critical issue with 
it, though, is the ratio of event production to event processing.  If that number is 
anything greater than 1, then more events are being produced in a stretch of time than 
can be effectively processed in that stretch of time.  Events start to queue up, 
taking up memory.  If it happens enough, the heap starts to get a little too full, the 
gc runs a little too much, and that causes processing time to suffer even further.  
Under most circumstances, event-based processing is like using a garden hose to water 
a bed of flowers.  It works just fine.  Under more intense cases, though, it can be 
more like using a garden hose to fill a small container of water, then leaving the 
hose laying around (spilling water all over the lawn) while the container gets carried 
off somewhere.

Comparitively, if a program decides to pull in more data to process, then there's an 
opportunity to control the amount that comes in at any given point.  This means that 
there's less (or no) need to worry about the rate at which data comes in, since it's 
turned on and off rather easily.  The amount of memory wasted is minimized (yes, I 
consider a wait queue to be a waste of memory, since it cannot be used for anything 
more productive), but the downside is that, of course, to keep the data streaming in 
for long periods of time tends to require continuous effort to tell the pulling system 
to pull in another chunk, much like how it takes effort to keep the valve open on a 
hose's spray gun.

There has been a time or two in my (admittedly, somewhat short) career as a developer 
where I've had cause to stop thinking in terms of an event system and instead work 
with a pull concept, and it was for the reason I gave- when an event source was 
allowed to generate events at its own pace, and the event handler took too long to 
process, the events piled up and performance suffered.  I'd expect a very similar 
situation could be expected in FOP.  SAX processing tends to fire a lot of events, and 
if FOP does a reasonable amount of processing work relative to the work needed to fire 
another event, then those events are piling up in memory and wasting space.  I can 
definitely see an argument for a pull-based system.  Also, I think that a push-model 
probably isn't going to scale as effectively to larger documents, where a pull system 
should have more constant performance regardless of document size.

Of course, take that with a mine of salt.


It's in Comparing XmlReader to SAX Reader page[1]: The push model can 
be built on top of the pull model. The reverse is not true. Too 
categorical statement, I think.


And, I believe, it might be wrong, though I must read the full source text.  The push 
model can be seen as a special case of a pull model in the sense of Pull everything 
ASAP, now and until the data is exhausted.  But, a pull model can be grafted onto a 
push model by implementing what amounts to a specialized buffer of the pushed data 
that accepts pull queries...no?


If so, we need more opinions from others.


My major interests lie in things happening above this layer, so I don't really have 
too much concern, but I definitely can see a good case for a pull-model.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]




Re: Alt-Design status: XML handling

2002-11-25 Thread Peter B. West
Oleg Tkachenko wrote:

Peter B. West wrote:


I don't believe is is only a matter of style.  I think the detrimental 
effects of push for general programming are glaringly obvious.

It's just event-driven processing, how it could be detrimental?


I may have referred to Dijkstra (R.I.P.) here before.  I think it was he 
who illustrated the importance of appropriate representations by 
reference to Roman numerals.  In the Middle Ages, before the coming of 
Arabic numerals and zero, long division was considered do-able, though 
very difficult, and was taught to the foolhardy at universities.  As I 
recall the story, the topic was computer languages, and the moral was: 
if you use a tool appropriate to the problem you are trying to solve, 
life will be much easier.

As for the selection of a language, so for the selection of a processing 
model.  Event-driven processing is appropriate to event-driven systems. 
 A traffic control system is an event-driven system, as is an operating 
system; processing an xsl:fo document is not.  The variability of xsl:fo 
processing is constrained within carefully defined hierarchical limits.

This shows in the simple-page-master debate.  Why has this generally 
been implemented in violation of the spec, while I picked that violation 
up the first time I ran against a variant file?  The children are 
determined by the parent, not the other way around.  So within an 
instance of simple-page-master, I expect the first child to be a 
region-body.  Following that, I expect a region-before, but I am not 
upset if it's not there.  Etc.  These relationships are quite naturally 
expressed in a manner the echoes the hierarchical ordering of the document.

How is this done with SAX?  Nodes are created without context - they 
just happen.  The node must grope around to find its parent, and the 
virtual tree is constructed from the children up.  The parent basically 
only gets control when its own ENDELEMENT event occurs.

I haven't read the documentation, but it may be that they are 
referring to the infeasibility of moving code built around SAX to an 
XmlReader environment.

It's in Comparing XmlReader to SAX Reader page[1]: The push model can 
be built on top of the pull model. The reverse is not true. Too 
categorical statement, I think.

Having read the reference, I agree.


This is not an optimisation, but a fundamental design decision.  It's 
all or nothing.  See the comments about the feasibility of moving from 
one model to the other.

If so, we need more opinions from others.


True enough for the HEAD line.  But FOP_0-20-0_Alt-Design will continue 
on the same track.  I have been working on it alone for nearly two years 
now, and for a year before it was even allowed into the code base.  Part 
of what I was doing was pure experiment, which I was prepared to 
abandon, but much is there because I believe in it, including the pull code.

I don't have to persuade a boss, in advance, that my approach is right. 
 I just have to persuade myself.  Then I can let the code do the 
talking.  It's called Open Source development.

[1] 
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpguide/html/cpconcomparingxmlreadertosaxreader.asp 

Given the interest in pull APIs for XML, another advantage of my code is 
that, when a low-level pull processor becomes available, it can be 
incorporated into my design with a minimum of fuss for greater efficiency.

Peter
--
Peter B. West  [EMAIL PROTECTED]  http://www.powerup.com.au/~pbwest/
Lord, to whom shall we go?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



Re: Alt-Design status: XML handling

2002-11-25 Thread Peter B. West
Rhett,

To comment on only two aspects of your posting.

Rhett Aultman wrote:


-Original Message-
From: Oleg Tkachenko [mailto:[EMAIL PROTECTED]]


Generally, event-driven processing is a pretty good thing.  The critical issue with it, though, is the ratio of event production to event processing.  If that number is anything greater than 1, then more events are being produced in a stretch of time than can be effectively processed in that stretch of time.  Events start to queue up, taking up memory.  If it happens enough, the heap starts to get a little too full, the gc runs a little too much, and that causes processing time to suffer even further.  Under most circumstances, event-based processing is like using a garden hose to water a bed of flowers.  It works just fine.  Under more intense cases, though, it can be more like using a garden hose to fill a small container of water, then leaving the hose laying around (spilling water all over the lawn) while the container gets carried off somewhere.


This is not a problem for at least the maintenance version of the code. 
 All of the processing is triggered by incoming SAX events, and occurs 
within the SAX callbacks.  These are synchronous events, so the parsing 
stalls until the callback returns.  Page-sequence rendering, e.g., 
occurs within the endElement() callback of an fo:page-sequence element.


It's in Comparing XmlReader to SAX Reader page[1]: The push model can 
be built on top of the pull model. The reverse is not true. Too 
categorical statement, I think.


And, I believe, it might be wrong, though I must read the full source text.  The push model can be seen as a special case of a pull model in the sense of Pull everything ASAP, now and until the data is exhausted.  But, a pull model can be grafted onto a push model by implementing what amounts to a specialized buffer of the pushed data that accepts pull queries...no?


Which is what I have done.

Peter
--
Peter B. West  [EMAIL PROTECTED]  http://www.powerup.com.au/~pbwest/
Lord, to whom shall we go?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]




Re: Alt-Design status: XML handling

2002-11-24 Thread Oleg Tkachenko
Peter B. West wrote:


Why is it easier for developers to use?  Is it because the API is less 
complex or more easily understood?  Not really.  As you point out, the 
SAX API is not all that complex.  The problem is that the processing 
model of SUX is completely inverted.
Well, I believe it's more philosophical question or a question of a 
programming style. push vs pull, imperative languages vs declarative languages 
etc etc etc ancient holy war. One likes to define rules aka sax handlers, 
another likes to weave a web from if statements, only to be able to control 
processing order ;) Both pull and push have pros and contras and it's a pity 
java still doesn't have a full-fledged pull parsing API (btw, James Clark is 
working on StAX[1], so it's a matter of time).

 You may have come to like writing 
XSLT that way.
It's the only way to write non-hello-world stylesheets in xslt actually. Don't 
forget, xlst is a declarative language, so probably analogies with java are 
just irrelevant, they are different beasts. The question is what is good for 
the fo tree building stuff? Probably you right, pull is more suitable, but the 
bad thing is that real input is SAX stream hence we must translate push to 
pull (funny enough ms considers this task as unfeasible one in XMLReader 
documentation). Hence next question is the cost of your interim buffer, what 
do you think could be its peak and average size?

Of course, as I have mentioned recently.  And as I also said, the cost 
of parsing relative to the intensive downstream element processing of 
FOP is small.
If so, isn't it too early to optimize xml handling altogether? What would we 
benefit from moving from push to pull? Well, sort of automatic validation is a 
benefit indeed, but I'm not sure it's enough.

The whole question is 
context-dependent.  If you are engaged in the peephole processing of SUX 
you may be obliged to use external validation.  With top-down processing 
you have more choice, because your context is travelling with you.
btw, what about unexpected content model objects? Will this fail?
fo:simple-page-master master-name=default
	fo:region-body/
	fo:block/
/fo:simple-page-master


Don't get me wrong here.  I'm not saying that external validation is 
wrong, merely that with a pull model, the need is reduced.  There may 
still be a strong case for it, but not as strong as with SUX.
You are right and that btw allows to make external validation optional and 
still to have reasonable level of validation for free.

[1] http://www.jcp.org/en/jsr/detail?id=173
--
Oleg Tkachenko
eXperanto team
Multiconn Technologies, Israel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



Re: Alt-Design status: XML handling

2002-11-23 Thread Peter B. West
Manuel...

Manuel Mall wrote:

Peter,

thanks for the update and explanation on your Alt-Design.

To be honest: I like it. Reminds me very much of my first exposure to
programming language processing (Compilers) nearly 30 years ago = top-down
recursive-decent parsing for Pascal. I still think its the best parsing
model around (beats YACC type stuff by a long way) in terms of ease of
development / understanding / use.


Recursive descent is like magic, isn't it?  I agree that it's a very 
tidy approach, which I have used a few times.  What motivated me here, 
though, was just the desire to have the flow of processing follow the 
natural hierarchy of the data.  Such an approach starts with a 
guaranteed basis of algorithmic clarity; the alternative, it seems to 
me, starts with a guaranteed basis of obscurity.  That, certainly, is 
what I found when I tried to follow the logic trail through the code.

The other idea was the old unix principle of the pipeline.  Isolate the 
components and have them communicate via (possibly bi-directional) 
pipelines of data/commands/events.  This doesn't map very cleanly onto 
the processes that operate on the FO tree and the layout/Area trees, but 
it was just what I needed to invert the flow of control during FO tree 
building.

Do you have any similar simple / effective ideas for the layout part which,
following the discussions on this list, the new FOP design under CVS HEAD
seems to struggle most with?


There are good reasons why the layout is not susceptible to the same 
simple solution.. I do have a number of ideas to contribute, and when 
the web site is restored I will be referring to some of the notes I have 
made and posted there.

Peter
--
Peter B. West  [EMAIL PROTECTED]  http://www.powerup.com.au/~pbwest/
Lord, to whom shall we go?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



RE: Alt-Design status: XML handling

2002-11-22 Thread Victor Mote
Victor Mote wrote:

 Oleg Tkachenko wrote:

  I think we should separate fo tree itself from the process of its
  building. fo
  tree structure is required and I agree with Keiron - it's not a
  DOM, it's just
  tree representation and I cherish the idea to make it an
  effectively small
  structure like saxon's internal tree. But any interim buffers should be
  avoided as much as it's possible (well, Piter's buffer seems not
  to be a burden).

 This is probably a philosophical difference. It seems to me that the area
 tree is built on the foundation of the fo tree, and that if we only get a
 brief glimpse of the fo tree as it goes by, not only does our foundation
 disappear, but we end up putting all of that weight into the
 superstructure,
 which tends to make the whole thing collapse.

Oleg:

After thinking about this a bit more, I think I confused this issue. I think
what you were saying is that the existing FOP FO tree /is/ the lightweight
data structure that you like. I see your point, and yes I agree, there is no
need to replace it with something heavier. My train of thought was in a
different direction -- ie. how to get that structure written to disk when
necessary so that it doesn't all have to be in memory. I (think I) also had
a wrong conception of how long the FO tree data persisted. My apologies for
the confusion.

Victor Mote


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]




Re: Alt-Design status: XML handling

2002-11-22 Thread Peter B. West
Victor Mote wrote:

Victor Mote wrote:



Oleg Tkachenko wrote:



I think we should separate fo tree itself from the process of its
building. fo
tree structure is required and I agree with Keiron - it's not a
DOM, it's just
tree representation and I cherish the idea to make it an
effectively small
structure like saxon's internal tree. But any interim buffers should be
avoided as much as it's possible (well, Piter's buffer seems not
to be a burden).


This is probably a philosophical difference. It seems to me that the area
tree is built on the foundation of the fo tree, and that if we only get a
brief glimpse of the fo tree as it goes by, not only does our foundation
disappear, but we end up putting all of that weight into the
superstructure,
which tends to make the whole thing collapse.



Oleg:

After thinking about this a bit more, I think I confused this issue. I think
what you were saying is that the existing FOP FO tree /is/ the lightweight
data structure that you like. I see your point, and yes I agree, there is no
need to replace it with something heavier. My train of thought was in a
different direction -- ie. how to get that structure written to disk when
necessary so that it doesn't all have to be in memory. I (think I) also had
a wrong conception of how long the FO tree data persisted. My apologies for
the confusion.


Victor,

I will comment at greater length, later, on the issues you have raised, 
but I want to make some comments on the tree structures here.

Most people coming to FOP get confused by the fact that SAX is used for 
parsing.  They think in terms of a SAX/DOM dichotomy, and assume that, 
because we are using SAX, we have nothing like a DOM.  In fact, the FO 
tree is our DOM, or the first stage of our DOM.  In the beginning... the 
FO tree was always there while the area tree was being built, but Mark 
Lillywhite did some hacking to restrict the tree to the currently active 
page sequence.

As you point out, the FO tree provides the semantics of the layout.  The 
Area tree is an internal representation of the series of marks on the 
page.  If re-flowing is called for, the information from the FO tree is, 
once again, required.  In my opinion, that means that the FO tree has to 
be cached.  To be more precise, the FO tree has to be able to be cached. 
 I envisage the layout engine feeding instructions back to the FO tree 
concerning subtrees; basically, delete subtree or cache subtree.  The 
layout engine knows whether the layout of a particular page or page 
sequence is firm or rubbery, and can instruct the FO tree accordingly.

Such decisions would be made very carefully in the layout engine.  Back 
in the mists of time, Arved noted that the page numbering problem could 
be minimised by allowing enough room for the page number worst case. 
That was a sensible restriction, but it implies a good guess about just 
what that worst case is going to be.  To get that completely right, you 
need to lay it all out.  In any case, if you have the ever-popular Page 
x of y in your static-content, you need to redo every page anyway. 
What the initial guess, if it's correct, circumvents, is the need to 
reflow every page, with all of its nightmarish implications.

This is a case for which the min/opt/max expressions of FOP were made.

Take a punt about last page number width.
Layout the pages, using optimum.
Get to the end, with all page numbers resolved.
Go back and reflow lines/paragraphs as necessary, using the full min/max 
range to avoid page under/overflow.

(N.B. This won't entirely remove the need for backup and reflow in other 
circumstances.)

I should point out here that I perceive the need for a third tree - a 
layout tree.  It parallels the layout managers, which themselves form a 
tree.  This is still a vague idea for me, but the layout tree would be 
the work-in-progress on the area tree.  It's necessary because much of 
the layout happens bottom-up, and at the bottom, layout is occurring 
which cannot go into the current page.  Firstly, you don't want to throw 
 away the layout work that you have already done.  Secondly, after the 
page boundary slashes across the layout you have been engaged in, you 
want to be able to pick up all of the threads again at the beginning of 
the new page.  The layout tree formalises this procedure.  Read Jeffrey 
Kingston's Lout design document for some insight on this.

When I talk about the layout engine, I have in mind the process that 
builds the layout tree, and moves chunks as they are completed into the 
area tree.

Peter
--
Peter B. West  [EMAIL PROTECTED]  http://www.powerup.com.au/~pbwest/
Lord, to whom shall we go?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



Re: Alt-Design status: XML handling

2002-11-21 Thread Nicola Ken Barozzi

Peter B. West wrote:
[...]


STATUS:

The XML pull buffering has been working for some considerable time.  I 
have simply been extending the get/expect methods on top of the simpler 
methods as I have found a requirement for them in building the FO tree.

In cases where the DTD is well known and well structured, XML pull is 
much easier to use than SAX.

For example, one can write a XSL styleshees with templates or with many 
for-each. SAX is similar to tomplates, XML pull is similar to for-each.

Having worked so much with SAX stuff, I can say that in many cases SAX 
is effectively a PITA, as DOM is for some, and as XML pull is too for some.

If the code proves to be easier to understand and write, it will be 
easier to fix and maintain, so this option should IMHO taken in strong 
consideration.

My 2c

--
Nicola Ken Barozzi   [EMAIL PROTECTED]
- verba volant, scripta manent -
   (discussions get forgotten, just code remains)
-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



RE: Alt-Design status: XML handling

2002-11-21 Thread Victor Mote
Peter B. West wrote:

 quote
...
 Echoing sentiments recently expressed in this publication, Clark said
 that SAX, though efficient, was very hard to use, and that DOM had
 obvious limitations due to the requirement that the document being
 processed be in memory. He suggested that what was needed was a standard
 pull API, one that efficiently allowed random access to XML documents.

First, thanks for the update on your work -- I understand what you are doing
a little better. Second, the statement above about random access almost
jumped out at me, because I had exactly the same thought earlier today while
contemplating a thread on the XSL-FO list which discussed processing of long
documents and memory constraints related to them. The closest thing to a
perfect document processing system that I have come across is FrameMaker,
which seems to be able to handle pretty large documents with a pretty small
footprint. I don't know for sure, but it seems to me that the area tree
(if you will) is written to disk, and pages can be efficiently jumped to in
an arbitrary manner. The WYSIWIG editor is essentially a viewport on the
portion of the document in memory, which is itself a subset of the disk
document. As you edit the document, I presume that events are sent to
something akin to a layout manager, which has to do something with them.
Now, in our case, we need to not only have random access to the area tree,
but also to the fo tree.

What follows is my feeble attempt to reconcile some of these issues.

The issue with SAX as I see it, is that because it is one-way, and our
processing is not (I think the standard calls it non-linear), we
presumably have to essentially build our own DOM-ish (random access) things
in order to get the job done. I wonder if we don't end up reinventing the
wheel in frustration with that approach. From a cleanliness of design
standpoint at least, it seems much more straightforward to instead use a
DOM-based approach and write chunks of the two DOMs to disk where necessary.
I haven't thought through whether java.io.RandomAccessFile or a regular
database or some other alternative would be the way to go. The LMs can be
totally protected from all of this by abstracting both the FO and Area
Documents -- in other words, they work with abstract nodes on trees and
don't care what was required to make them available.

Oddly enough, once I have the stability of the DOMs to work from (perhaps
this is more felt than real), an event-based approach seems much more
natural -- like imitating a word processor. In fact, if done properly,
another project could conceivably use FOP as the layout engine for a WYSIWIG
editor. Actually I have been trying to quantify  grasp two processing
models that come to mind: 1) the word-processing model, an event-based
model, and 2) an 18th-century typesetter manually laying out pages, which is
much more of a look-ahead, measure-it-to-see-how-it-fits-before-placing-it
model.

These two models roughly correspond to the two processing models I mentioned
the other day (I am text, place me somewhere vs. I am a page with room,
place something on me). The second model requires the 2-pass approach. The
first fits either a push or a pull approach (since we can manufacture events
if we need to), the second is definitely pull. When I wrote about those two
models, I was frankly leaning heavily toward the 2nd approach, but I think I
am changing my mind. To explain why, I need to have you forget for a moment
about our SAX-based input (I'll come back to that). Forget also about
performance for a moment, and picture the typesetter setting type one
character at a time, with no thought of what the next character or image
is -- in other words, setting type just like a user sitting at Microsoft
Word does. If the typesetter comes to a concept that messes his previous
work up, he has to yank a line of type out, or perhaps an entire page out,
and replace them. However, (and this is the key point), he eventually will
get the job done. In other words, when abstracted this way, the only benefit
to a look-ahead /should be/ performance. Consider our auto table layout
problem. If on the 350th page of the table, I find an item that requires me
to change the width of the columns, which in turns changes the layout of all
350 pages, yes, I am going to burn up a few cycles to accomplish that, but I
/should/ be able to get it done.

So far all I have done is loosely reconciled these two processing models.
The next thing I want to do is to try to compare these two models with FOP's
layout  process. If I like the event-based model, then maybe I ought to like
FOP's approach. Let me go first to my 18th-century typesetter. Each time he
has to tear out a line or page of type, he can go back to his manuscript
(his FO document, if you will) to rebuild them. Similarly in a word
processor, I presume that Microsoft Word must have some concept that the 2
lines at the top of page 84 are in the same paragraph as the 3 

RE: Alt-Design status: XML handling

2002-11-21 Thread Keiron Liddle
On Thu, 2002-11-21 at 12:43, Victor Mote wrote:
 To conclude, if I were designing this system from scratch, based on what I
 know right now, I would:
 1. Use DOM for both the fo tree  the area tree.

I don't know whether I would call it a DOM but the area tree is an
independant data structure that contains all the information necessary
to render the document.


 2. Write them to disk when necessary, hiding all of this from the layout
 managers.

This has already been done for the area tree. I use the
CachedRenderPagesModel all the time. If it cannot render the page
immediately then it saves it to disk. The layout managers only know
about adding areas to a page and then adding the page to the area tree.
For rendering it can dispose of the page once rendered, for awt viewer
it could save all pages to disk and load when necessary.

As described here (written a long time ago and needs updating):
http://xml.apache.org/fop/design/areas.html

I don't see why you would need all the fo tree available, each page
sequence is independant for the flow and often each page can be finished
before the next page.

 3. Use an event-based layout mechanism so that the fo tree doesn't even have
 to be there to get layout work done.

Depends exactly what you mean but I think that is the general idea, care
to implement it.

 I am sure I can be talked out of this by someone smarter, but I wanted to
 lay out the whole line of reasoning. My apologies to Peter and anyone else
 who may have been working on these points before. I am just now getting
 around to them.
 
 After further consideration, my use of event-based above may be too
 strong. Probably what I mean is more along the lines of API-based. In a
 WYSIWIG environment, the event would probably trigger an API action, but
 that action could be invoked another way as well. I am too tired to rewrite
 it -- I hope you know what I mean.
 
 This final thought is really a question which was briefly addressed during
 our recent weekend clarification about the role of the maintenance branch,
 and which I wish to apply specifically to the above thoughts. Does or could
 the new design give us the ability to (with say, a configuration option)
 choose between Layout Philosophy A and B? By this I mean 2 (or more) layout
 packages coexisting in the same code base, and sharing common resources that
 can be selected (configurable perhaps). If so, then we can play with these
 ideas at our leisure, compare them in various ways, transition between them
 if necessary, and maybe even keep both to be used in various circumstances.
 I think someone (Jeremias perhaps) had indicated that something along these
 lines would be possible, but that may have been at a lower level than what I
 am discussing here.

This should be quite simple to do.
There would be a basic interface set for the layout managers when being
created by the fo tree. We could possible have a common one for inline
objects.
The actual layout implementation could then be changed.

Again, this will need to be implemented.


 I don't mean to rock the boat. I guess I am kind of like a three-year-old
 who asks why and why not all of the time to the annoyance of all around
 him -- I am still trying to learn the basics. Thanks for your patience.

I keep getting the feeling that everyone is saying the current design is
wrong and here is a better idea, which turns out to be the same as the
current design.

When will people start doing it?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]




Re: Alt-Design status: XML handling

2002-11-21 Thread Oleg Tkachenko
Victor Mote wrote:


The issue with SAX as I see it, is that because it is one-way, and our
processing is not (I think the standard calls it non-linear), we
presumably have to essentially build our own DOM-ish (random access) things
in order to get the job done.

I think we should separate fo tree itself from the process of its building. fo 
tree structure is required and I agree with Keiron - it's not a DOM, it's just 
tree representation and I cherish the idea to make it an effectively small 
structure like saxon's internal tree. But any interim buffers should be 
avoided as much as it's possible (well, Piter's buffer seems not to be a burden).

To conclude, if I were designing this system from scratch, based on what I
know right now, I would:
1. Use DOM for both the fo tree  the area tree.

Bad idea, I believe. DOM is heaviweight versatile representation of xml 
document (recall entities, pi's etc nodes), while we need effective and 
lightweight structure to hold fo/area tree information. DOM has a lot of 
synchronization stuff, while our trees are almost read-only actually.
Ahh stop, probably you didn't mean w3c DOM?

--
Oleg Tkachenko
eXperanto team
Multiconn Technologies, Israel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]



Re: Alt-Design status: XML handling

2002-11-21 Thread Peter B. West
Oleg,

...

Oleg Tkachenko wrote:

Peter B. West wrote:


taking a very isolated path.  My motivation can be summed up in the 
slogan SAX SUX.  I couldn't understand why anyone would persist with 
it for any complex tasks, e.g. FOP.

Actually I cannot say I fully agree with this, because I don't see 
nothing complex in SAX processing model. And being xslt fan I'm 
obviously push-model fan.

...

significant difference makes XmlReader much easier to use for most 
Microsoft developers that are used to working with firehose 
(forward-only/read-only) cursors in ADO.


Well, lets consider pull model pros and contras:
+ easy to use by developer
+ benefits by kind of structural validation
+ more?


Why is it easier for developers to use?  Is it because the API is less 
complex or more easily understood?  Not really.  As you point out, the 
SAX API is not all that complex.  The problem is that the processing 
model of SUX is completely inverted.  You may have come to like writing 
XSLT that way.  You may be working with very general grammars, and have 
no other choice.  That doesn't make the inverted, inside-out model any 
more natural for the expression of processes and algorithms.  Easier 
for developers to use means an easier vocabulary for the expression and 
solution of programming models and problems; it means easier to 
document, easier to read and understand, easier to maintain and extend 
(in the sense of adding functionality).


- it glues processing to a particular xml structure, what is not so bad 
for vocabularies with well-defined and stable content model. The 
question is whether xsl-fo is a such kind of a vocabulary? I think it 
doesn't. As a matter of fact xsl-fo even inexpressible in dtd or schema, 
besides of possibility of extensions.

I think that a W3C Recommendation qualifies as a well-defined and stable 
vocabulary.  Hmm.  Well, you know what I mean.  It changes only 
infrequently, the changes are well-defined, and are going to involve 
changes, possibly major, to many parts of the code base anyway.  It 
certainly cannot be described as a dynamic vocabulary.


- is there performance penalty? I used to think that easy to use stuff 
always costs something.

Of course, as I have mentioned recently.  And as I also said, the cost 
of parsing relative to the intensive downstream element processing of 
FOP is small.  Obviously, you would look at optimising that as much as 
possible.

- more?


Note also that the structure of the code does its own validation.  It 
generates the simple-page-master subtree according to the content model

(region-body,region-before?,region-after?,region-start?,region-end?)

That's good, but it's not full-fledged validation unfortunately. To many 
own validation is bad I believe. If we need validation it must be done 
by specialized validation module and validation should not be scattered 
throughout the whole code.

Much of the validation of FOP has to be self-validation anyway, because 
so many of the constraints are context-dependent.  The whole question is 
context-dependent.  If you are engaged in the peephole processing of SUX 
you may be obliged to use external validation.  With top-down processing 
you have more choice, because your context is travelling with you.

Don't get me wrong here.  I'm not saying that external validation is 
wrong, merely that with a pull model, the need is reduced.  There may 
still be a strong case for it, but not as strong as with SUX.

And final question - what's wrong with SAX besides of possible complexity?


Isn't that enough?

Peter
--
Peter B. West  [EMAIL PROTECTED]  http://www.powerup.com.au/~pbwest/
Lord, to whom shall we go?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]




RE: Alt-Design status: XML handling

2002-11-21 Thread Victor Mote
Keiron Liddle wrote:

 On Thu, 2002-11-21 at 12:43, Victor Mote wrote:
  To conclude, if I were designing this system from scratch,
 based on what I
  know right now, I would:
  1. Use DOM for both the fo tree  the area tree.

 I don't know whether I would call it a DOM but the area tree is an
 independant data structure that contains all the information necessary
 to render the document.


  2. Write them to disk when necessary, hiding all of this from the layout
  managers.

 This has already been done for the area tree. I use the
 CachedRenderPagesModel all the time. If it cannot render the page
 immediately then it saves it to disk. The layout managers only know
 about adding areas to a page and then adding the page to the area tree.
 For rendering it can dispose of the page once rendered, for awt viewer
 it could save all pages to disk and load when necessary.

 As described here (written a long time ago and needs updating):
 http://xml.apache.org/fop/design/areas.html

OK, I just went back  reread it. There is still something I don't
understand  I'll get to that in a minute. First, let me say that perhaps
the better way for me to learn this would be to follow it in a debugger. I'm
not too lazy to do that, and if /these issues/ are working pretty well right
now, then that is probably what I should be doing -- just say the word.

Here is what (after reading the doc  considering your comments) my thick
head doesn't yet grasp -- when we say the page is cached, I understood
that to mean that it is immutably laid out, but that it can't be rendered
yet because some previous page cannot be finally laid out yet. What I am
trying to address is the situation, like the auto table layout, where
something I see while trying to lay out page 5000 changes pages 150-4999 as
well. I have to now push some rows or lines from 150 to 151, which triggers
pushing some lines from 151 to 152, etc. So the first question is whether
the cached pages are changeable or unchangeable. If changeable, then we
should be able to deal with arbitrarily long documents and (assuming some
reasonable amount of basic memory) not worry about running out of memory --
constrained only by disk space.

The second question that I am trying to grasp is, if the cached pages are
changeable, then how do we know enough to change those 4850 pages properly
without having our fo tree at hand unless we duplicate the information from
the fo tree in the area tree.

 I don't see why you would need all the fo tree available, each page
 sequence is independant for the flow and often each page can be finished
 before the next page.

You are correct for the current use-case. I have jumped a bit past that into
trying to make room for other use-cases that might require the fo to be
changed and serialized (the WYSIWIG). Setting that issue aside for the
moment, let me rephrase the question, because this is really the huge big
issue that makes me uneasy with SAX. Don't we need random access to the fo
tree for the current page sequence?
  * If so, then, using SAX, don't we have to duplicate that same
information in the area tree to be able to handle rebuilding
4850 pages?
  * If not, then, in a big-picture way, how do we go about
rebuilding 4850 pages?

  3. Use an event-based layout mechanism so that the fo tree
 doesn't even have
  to be there to get layout work done.

 Depends exactly what you mean but I think that is the general idea, care
 to implement it.

OK, I see where I was not clear. In my mind, if there is no fo tree to tie
the pieces of the area tree to, you basically have to build one. This is why
I brought up Word  FrameMaker -- their object models keep the organization
of the document (analagous to our fo tree) intact. My theory is that we
eventually hurt ourselves by trying to avoid this. The difference is that
they have to serialize that organization, and we don't, at least for our
current use case. However, perhaps because I am still confused about our
general strategy for dealing with the ripple-effect of downstream changes,
their model seems to be a good one. I am envisioning an area tree that
contains no text at all, but whose objects merely point to nodes  offsets 
sizes in the fo tree. So, for example Line object l contains an array of
LineSegment objects, one of whose contents comes from an FOText node,
starting at offset 16, size 22. Not only is my text there, but so is most of
my font and formatting information. What I have is two different views of
the same data -- one that is more structural and the other the specific
layout. I have no problem (in our current use-case) with throwing away
page-sequences from the fo tree and area tree to free up memory as we go.

 Does or could
  the new design give us the ability to (with say, a configuration option)
  choose between Layout Philosophy A and B? By this I mean 2 (or
 more) layout
  packages coexisting in the same code base, and sharing common
 resources that
  can be selected 

RE: Alt-Design status: XML handling

2002-11-21 Thread Victor Mote
Oleg Tkachenko wrote:

 I think we should separate fo tree itself from the process of its
 building. fo
 tree structure is required and I agree with Keiron - it's not a
 DOM, it's just
 tree representation and I cherish the idea to make it an
 effectively small
 structure like saxon's internal tree. But any interim buffers should be
 avoided as much as it's possible (well, Piter's buffer seems not
 to be a burden).

This is probably a philosophical difference. It seems to me that the area
tree is built on the foundation of the fo tree, and that if we only get a
brief glimpse of the fo tree as it goes by, not only does our foundation
disappear, but we end up putting all of that weight into the superstructure,
which tends to make the whole thing collapse.

  To conclude, if I were designing this system from scratch,
 based on what I
  know right now, I would:
  1. Use DOM for both the fo tree  the area tree.
 Bad idea, I believe. DOM is heaviweight versatile representation of xml
 document (recall entities, pi's etc nodes), while we need effective and
 lightweight structure to hold fo/area tree information. DOM has a lot of
 synchronization stuff, while our trees are almost read-only actually.
 Ahh stop, probably you didn't mean w3c DOM?

You and Keiron are right -- this is a classic example of using an
implementation where an interface would be much better. When I say DOM, what
I should say is some randomly-accessible view of the entire tree.
Certainly, if there is a lighter-weight alternative than DOM that works for
the task at hand, that is better.

Victor Mote


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]




RE: Alt-Design status: XML handling

2002-11-20 Thread Manuel Mall
Peter,

thanks for the update and explanation on your Alt-Design.

To be honest: I like it. Reminds me very much of my first exposure to
programming language processing (Compilers) nearly 30 years ago = top-down
recursive-decent parsing for Pascal. I still think its the best parsing
model around (beats YACC type stuff by a long way) in terms of ease of
development / understanding / use.

Do you have any similar simple / effective ideas for the layout part which,
following the discussions on this list, the new FOP design under CVS HEAD
seems to struggle most with?

Manuel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]




Re: Alt-Design status: XML handling

2002-11-20 Thread Bertrand Delacretaz
Great work Peter!
It makes a lot of sense to use higher-level than SAX events, and thanks for 
explaining this so clearly.

If you allow me a suggestion regarding the structure of the code: maybe using 
some table-driven stuff instead of the many if statements in 
FoSimplePageMaster would be more readable?

Something like:

class EventHandler
{
  EventHandler(String regionName,boolean discardSpace,boolean required)
  ...
}

/** table of event handlers that must be applied, in order */
EventHandler [] handlers = {
  new EventHandler(FObjectNames.REGION_BODY,true,true),
  new EventHandler(FObjectNames.REGION_BEFORE,true,false)
};

...then, in FoSimplePageMaster(...) loop over handlers and let them process 
the events.

I don't know if this applies in general but it might be clearer to read and 
less risky to modify.

-Bertrand

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]