On Jan 16, 2008, at 08:38, Jeremias Maerki wrote:
On 16.01.2008 01:20:36 Andreas L Delmelle wrote:
At the moment, we always wait for an endPageSequence() call on the
AreaTreeHandler, which works fine for small to medium-sized page-
sequences, but is definitely not scaleable to larger ones consisting
of a lot of FOs. I think we should take a look at implementing
(), for instance, or startFlow(). At those points, we are already
guaranteed to have at least the part of the FOTree that is necessary
to perform some basic preliminary layout (the *ahem* Pagination and
What's "preliminary layout"? I don't get it.
Sorry, my bad. I think I meant "layout preparation".
As in: the fo:layout-master-set is completely available, so a certain
(minimal) amount of empty pages could already be prepared here,
without knowing anything about the fo:flow or its descendants.
startPageSequence() would also be an option. The idea is to
initialize the PageSequenceLM as early as possible, where right now,
it does not even exist until the endPageSequence() event occurs. The
only benefit of waiting this long, is that we have a guarantee that
no layout-work will be performed unless we are 100% certain that the
produced FO is valid. The downsides for the larger and more complex
documents, however, seem to outweigh this one benefit, especially if
you take into account that a later page-sequence may still cause the
document to fail...
For the record: this is a general problem that I have already
encountered in a lot of XML applications. Enormously large XML files
lead to trouble, since the applications are DOM-based behind the
scenes. The only reason FOP is able to handle fo-files that cannot
even be opened in a lot of XML-editors, simply due to memory-
limitations, is precisely that it avoids creating a DOM for the
entire FO. We already use a nice combination of both approaches, but
it still offers room for expansion.
startFlow/endFlow doesn't help at all. That only excludes the static
content from the page-sequence. One flow could still be huge.
Indeed, I agree that this wouldn't help, if it's only restricted to
OTOH, startFlow/endFlow are called for static-contents too (see
Other handlers that could turn out to be interesting to implement
(but I'm guessing this to be interesting only for the flow, not for
* endExternalGraphic() / endInstreamForeignObject()
* endInlineContainer() [... ;-) ...] / endBlockContainer()
* endInline() / endBlock()
In fact, there are a whole lot more. The idea is obviously not to
have the PageSequenceLM run over the entire page-sequence multiple
times, but to have the next childLM continue where the last one left
off. If the area addition is started in yet another thread, I think,
it would even become possible to release/GC parts of the FOTree (have
the LMs dereference their FO) long before we even reach the first
endPageSequence() during parsing.
The key here would be to have mechanisms to limit memory
the FO is built up faster than the layout engine can consume it you
still haven't gained anything.
Very good point!
Smells like a lot of thread
synchronization and complexity if you do it the multi-threading way.
Even single-threaded, the complexity would grow again because there
be more interaction between the different parts of FOP.
The complexity can be kept at the strict minimum by limiting the
number of threads to an amount you can count on one hand (5 max.),
which would incidentally also place a limit on memory-consumption.
At least, it would prevent OOMErrors due to 20 page-sequences being
processed at the same time, but we could still run out of memory due
to 2 or 3 page-sequences of 100 pages each.
Besides that, a larger number of threads would be worse for
performance *and* would make the whole thing too difficult to debug
and maintain. (remember the initial PropertyCache I committed, with
the way-too-many CleanerThreads... a headache to debug, and a
performance bottleneck: this should obviously be avoided :/)
Furthermore, you need to know exactly when you can release an FO tree
or layout object, i.e. when you're absolutely sure that you won't need
it anymore. Currently, the first inline FO in a page-sequence is
memory even if the layout engine is already on page 234.
Yep, I know. At one time, I tried (very simply) to clear FOText.ca in
TextLayoutManager, since TextLM duplicates the array upon
If I remember correctly, when the areas are added, the original
FOText.ca is still referenced, so I ended up with a
Suddenly I'm thinking we'd also need to take care that we don't
enforce this, since there are definitely use-cases (a 'live' FO
editor), where it becomes necessary/desirable to maintain the entire
FOTree at all times (or at least a link between the original FO and
the generated Area). For those cases, instead of releasing the
objects, we could consider serialization/deserialization. To disk, or
even, as I seem to remember being suggested in a Bugzilla report
(1063), by using a dedicated database engine (think: optional
dependency on Apache Derby, and some relatively straightforward JDBC
The latter could turn out to be an important feature, since dedicated
application servers on which FOP runs usually appreciate any spare
byte of disk space and as little unnecessary stress on disk I/O as
possible so it can be reserved for heap-swapping.
I'm not entirely sure yet, but I have a vague feeling that Simon's
Interleaved_Page_Line_Breaking branch will be quite beneficial in
getting this right (or may even be the key to making the whole thing
feasible in the first place).
His work is the precondition for that to be become possible in the
I agree, especially after just reading Simon's response.
I appreciate you starting this discussion but I think it's slightly
Thought so too, but then I started dreaming again... :-)
Anyway, thanks for the feedback!