Hi Craig,

Before you start working on that, look at

https://issues.apache.org/bugzilla/show_bug.cgi?id=47314

Ben Wuest did some stuff to start rendering without a finished page sequence.

Regards,

Georg Datterl

------ Kontakt ------

Georg Datterl

Geneon media solutions gmbh
Gutenstetter Straße 8a
90449 Nürnberg

HRB Nürnberg: 17193
Geschäftsführer: Yong-Harry Steiert

Tel.: 0911/36 78 88 - 26
Fax: 0911/36 78 88 - 20

www.geneon.de

Weitere Mitglieder der Willmy MediaGroup:

IRS Integrated Realization Services GmbH:    www.irs-nbg.de
Willmy PrintMedia GmbH:                            www.willmy.de
Willmy Consult & Content GmbH:                 www.willmycc.de


-----Ursprüngliche Nachricht-----
Von: Craig Ringer [mailto:[email protected]]
Gesendet: Freitag, 10. September 2010 10:14
An: [email protected]
Cc: Georg Datterl
Betreff: Re: AW: Memory Leak issue -- FOP

On 09/10/2010 03:44 PM, Georg Datterl wrote:
> Hi Hamed,
>
> I did some pretty large publications with lots of images. 1500 pages
> took 2GB memory, after I put some effort in memory optimization. The
> only FOP-related issue I found was image caching and that can be
> disabled. I'm quite sure I would have found a memory leak in FOP,
> especially one related to ordinary LayoutManagers. So either make your
> page-sequences shorter or give fop more memory.

I can't help but wonder if FOP needs to keep the whole page sequence in
memory, at least for PDF output. Admittedly I haven't verified that it
*is* keeping everything in RAM, but that's certainly a whole lot of RAM
for a moderate-sized document.

I've been meaning to look at how fop is doing its PDF generation for a
while, but I've been head-down trying to finish a web-based UI for work
first. I do plan to look at it though as I've done a fair bit of work on
PDF generation libraries and I'm curious about how Fop is doing it (and
how much wheel-reinvention might be going on).

Anyway, PDF is *designed* for streaming output, so huge PDFs can be
produced using only very small amounts of memory with a bit of thought
into how the output works. I've had no issues generating
multi-hundred-megabyte PDF documents with very small amounts of RAM
using PoDoFo, a C++ PDF library that supports direct-to-disk PDF generation.

There are all sorts of tricks you can do. The most important is of
course that you can make back- or forward- indirect references to almost
any object, with no constraints on object order in the document. You can
write whatever you generate out very aggressively. You can even split
your content stream(s) for each page into multiple segments so you can
write the content stream out when it gets too big. Or write the content
stream to a tempfile, then merge it into the PDF after the other
resources for the page have been written.

There should be no need for image caching, because once you've written
the image object to the PDF once, you can just reference it again in
later pages. Not only does that save RAM but it makes your PDF smaller
and faster. It works even if your image is used in different sizes,
scales, etc in different parts of the document, because you can crop and
scale using content-stream instructions.

You don't even have to keep the page dictionaries in RAM. You can write
them out when the page is done (or before). Because forward-indirect
references are permitted, if you have content on the page that's yet to
be generated you can reserve some object IDs for those content streams
and output indirect references to the as-yet nonexistent content streams
in the page dictionary.

About the only time I can think of when you have to keep something in
memory (or at least, in a tempfile) is when you have content in a page
(like total page counts) that cannot be generated until later in the
document - and may re-flow the rest of the page's content. If the
late-generated content won't force a reflow it can just be put in a
separate content stream with a forward-reference.

Admittedly, I'm speaking only about the actual PDF generation. It may
well be that generating the AT/IF is inherently demanding of resident
RAM, or that the IF/AT don't contain enough information to generate
pages progressively.

The point, though, is that PDF output shouldn't use much RAM if the PDF
output code is using PDF features to make it efficient. Sometimes it's a
trade-off between how efficient the produced PDF is and how efficient
its creation is, but you can always post-process (optimize) a PDF once
it's generated if you want to do things like linearize it for fast web
loading.

--
Craig Ringer

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to