Re: FOP and large documents: out of memory

Vincent Hennebert Thu, 14 Jan 2010 04:05:52 -0800

Hi Stephan,

I’m not sure I would invest any energy into improving the
CachedRenderPagesModel (-conserve option). It doesn’t look like the
right approach to me, and like you noticed it doesn’t even work out of
the box currently.

Why store the Area Tree on disk? Why not directly render it into the
final output format? If that latter supports out-of-order pages, then
that’s great; Otherwise we may as well store the final pages and order
them later on when the document is complete, instead of storing them in
a half-finished area tree format.

As to pages that hold unresolved references, so can’t obviously be
rendered yet: there usually aren’t that many of them that would make the
area tree solution vastly superior to a final format one in term of
memory consumption. Those ones could be kept in memory until all the
references they hold are resolved.

Also, the handling of forward references is currently less than optimal.
The resolution is made in the area tree instead of looping back to the
layout engine. ATM, a page-reference is rendered using a placeholder
string (‘MMM’), and that placeholder is later replaced with the actual
value (e.g., ‘5’). This is fine for constructs like tables of content,
but may produce ugly results if the page-number-citation is inside
a paragraph, ruining the even spacing. What’s the point of implementing
a high-quality line-breaking algorithm if its output is spoiled by
a poor handling of page citations?

I think the two-pass approach is the best long-term solution, although
obviously less trivial. One challenge is to detect a possible infinite
loop. For example: referenced item is at the beginning of page IX,
reference is updated to IX, which takes less room than MMM, so the
document is re-laid out and referenced item is moved to page VIII;
Reference must be updated again, document is laid out again and
referenced item end up on page IX again. And again, and again...

One possible workaround for your use case is to generate your document
once with a dummy TOC and just “Page X” into the intermediate format;
Parse it to get the total number of pages and the page numbers for each
element of the TOC; Re-generate it with hardcoded values for page
references.

HTH,
Vincent

Stephan Thesing wrote:
> Hello,
> 
> as is well-known, FOP can run out of heap memory, when large documents
> are processed (http://xmlgraphics.apache.org/fop/0.95/running.html#memory).
> 
> I have the situation that the documents I have to process mandate a footer on 
> each page that contains a "page X of Y" element and a TOC at the
> beginning of the document, i.e. FOP cannot layout the pages until all
> referenced page-citations are known, which is after the last page of the 
> document.
> 
> When page content is quite complicated (e.g. 2000 pages mostly full with 
> tables), the heap space does not suffice to hold all pages until all 
> references can be resolved, thus FOP aborts with out-of-memory.
> 
> Since increasing the heap space does not always work (3 GB heap space was 
> required in one example), I need a better solution for this.
> 
> 1. "-conserve" option
> One alternative would be the "-conserve" option, which serializes the pages 
> to disk and reloads them as needed.
> Although slow, this definitely would be a solution, if it worked, which it 
> doesn't:
>  Our documents include graphics (SVG, PNG), and the serialization with 
> "-conserve" throws an exception, because some class in Batik is not 
> serializable (e.g. "SVGOMAnimatedString" IIRR), thus the page is missing, 
> causing FOP to abort later.
> Thus, Batik would have to be fixed for this.
> 
> 2. Two passes
> Since the pages are kept because of unresolved references, one could do the
> same as e.g. LaTeX always did: process the document twice.
> In a first run, pages are discarded after layout, only the references for 
> page-citations are kept and at the end reused for the second pass
> (when all pages for the citations are finally known).
> For the second run, these id-refs are initially loaded and no pages have
> to be kept.
> This would require more changes in FOP (and should definitely be made 
> optional obviously).
> 
> 
> 
> I would appreciate any comments or other suggestions !
> 
> 
> Best regards
>   Stephan

Re: FOP and large documents: out of memory

Reply via email to