On Jan 16, 2008, at 08:38, Jeremias Maerki wrote:

Hi Jeremias

On 16.01.2008 01:20:36 Andreas L Delmelle wrote:
<snip />
At the moment, we always wait for an endPageSequence() call on the
AreaTreeHandler, which works fine for small to medium-sized page-
sequences, but is definitely not scaleable to larger ones consisting
of a lot of FOs. I think we should take a look at implementing endFlow
(), for instance, or startFlow(). At those points, we are already
guaranteed to have at least the part of the FOTree that is necessary
to perform some basic preliminary layout (the *ahem* Pagination and
Layout FOs).

What's "preliminary layout"? I don't get it.

Sorry, my bad. I think I meant "layout preparation".
As in: the fo:layout-master-set is completely available, so a certain (minimal) amount of empty pages could already be prepared here, without knowing anything about the fo:flow or its descendants. startPageSequence() would also be an option. The idea is to initialize the PageSequenceLM as early as possible, where right now, it does not even exist until the endPageSequence() event occurs. The only benefit of waiting this long, is that we have a guarantee that no layout-work will be performed unless we are 100% certain that the produced FO is valid. The downsides for the larger and more complex documents, however, seem to outweigh this one benefit, especially if you take into account that a later page-sequence may still cause the document to fail...

For the record: this is a general problem that I have already encountered in a lot of XML applications. Enormously large XML files lead to trouble, since the applications are DOM-based behind the scenes. The only reason FOP is able to handle fo-files that cannot even be opened in a lot of XML-editors, simply due to memory- limitations, is precisely that it avoids creating a DOM for the entire FO. We already use a nice combination of both approaches, but it still offers room for expansion.

startFlow/endFlow doesn't help at all. That only excludes the static
content from the page-sequence. One flow could still be huge.

Indeed, I agree that this wouldn't help, if it's only restricted to those two. OTOH, startFlow/endFlow are called for static-contents too (see fo.pagination.StaticContent#startOfNode()/endOfNode()).

Other handlers that could turn out to be interesting to implement
(but I'm guessing this to be interesting only for the flow, not for
the static-content):
* endExternalGraphic() / endInstreamForeignObject()
* endInlineContainer() [... ;-) ...] / endBlockContainer()
* endInline() / endBlock()

In fact, there are a whole lot more. The idea is obviously not to
have the PageSequenceLM run over the entire page-sequence multiple
times, but to have the next childLM continue where the last one left
off. If the area addition is started in yet another thread, I think,
it would even become possible to release/GC parts of the FOTree (have
the LMs dereference their FO) long before we even reach the first
endPageSequence() during parsing.

The key here would be to have mechanisms to limit memory consumption. If
the FO is built up faster than the layout engine can consume it you
still haven't gained anything.

Very good point!

Smells like a lot of thread
synchronization and complexity if you do it the multi-threading way.
Even single-threaded, the complexity would grow again because there will
be more interaction between the different parts of FOP.

The complexity can be kept at the strict minimum by limiting the number of threads to an amount you can count on one hand (5 max.), which would incidentally also place a limit on memory-consumption. At least, it would prevent OOMErrors due to 20 page-sequences being processed at the same time, but we could still run out of memory due to 2 or 3 page-sequences of 100 pages each. Besides that, a larger number of threads would be worse for performance *and* would make the whole thing too difficult to debug and maintain. (remember the initial PropertyCache I committed, with the way-too-many CleanerThreads... a headache to debug, and a performance bottleneck: this should obviously be avoided :/)

Furthermore, you need to know exactly when you can release an FO tree
or layout object, i.e. when you're absolutely sure that you won't need
it anymore. Currently, the first inline FO in a page-sequence is kept in
memory even if the layout engine is already on page 234.

Yep, I know. At one time, I tried (very simply) to clear FOText.ca in TextLayoutManager, since TextLM duplicates the array upon initialization. If I remember correctly, when the areas are added, the original FOText.ca is still referenced, so I ended up with a NullPointerException...

Suddenly I'm thinking we'd also need to take care that we don't enforce this, since there are definitely use-cases (a 'live' FO editor), where it becomes necessary/desirable to maintain the entire FOTree at all times (or at least a link between the original FO and the generated Area). For those cases, instead of releasing the objects, we could consider serialization/deserialization. To disk, or even, as I seem to remember being suggested in a Bugzilla report (1063), by using a dedicated database engine (think: optional dependency on Apache Derby, and some relatively straightforward JDBC code). The latter could turn out to be an important feature, since dedicated application servers on which FOP runs usually appreciate any spare byte of disk space and as little unnecessary stress on disk I/O as possible so it can be reserved for heap-swapping.

I'm not entirely sure yet, but I have a vague feeling that Simon's
Interleaved_Page_Line_Breaking branch will be quite beneficial in
getting this right (or may even be the key to making the whole thing
feasible in the first place).

His work is the precondition for that to be become possible in the first

I agree, especially after just reading Simon's response.

I appreciate you starting this discussion but I think it's slightly too

Thought so too, but then I started dreaming again... :-)

Anyway, thanks for the feedback!



Reply via email to