Howdy fopsters

Yesterday I spent about 10 hours with FOP, vi, and a memory profiler (JProbe 3 profiler - evaluation version, amazing product, but rather expensive :-{ ). I have some observations of the FOP code that I thought would be useful to share.

Firstly I understand that FOP is being redesigned. I followed the thread there a bit and one of the main concerns, as I recall, was that FOP is hard for newbies (like me) to understand. I would say that this is not true. I actually found FOP reasonably easy to get around. Considering that yesterday is the first time I've managed to really get into the guts of it, and that I probably am familiar with only half a percent of the code base, I would say that generally FOP is pretty well written and reasonably straightforward. Perhaps there are areas of it that need refactoring and documenting (and formatting to 8 character tabs ;) but from what I saw of it ... I figured that any redesign would be equally impenetrable, as all large code bases are. Anyway, that's just my .02c from the 'newbie FOP programmer' perspective.

My goals for the profiling were to enable FOP to process large documents with a standard environment (64Mb heap on JDK1.3.1 for Linux). I have a 4,500 page document (containing a very simple structure) and standard FOP dies at 600 pages. (By the end of the day it was up to over 3,000 pages). I discovered the "-buf" argument and tried that but that only extended the run to 900 pages and took about 3x the amount of time. My eventual goal is for fast processing of an unlimited number of pages, but I'll personally be happy at 10,000 slightly more complex pages.

Basically what I discovered with the profiler was that there are a large amount of objects being hung onto for a long time. (obviously ;). In particular, the Block object holds a reference to a BlockArea object in a member field (blockArea), but the blockArea member is referenced only twice in other methods for simple field calls. Making the BlockArea a local variable for the layout(Area) method significantly reduces memory consumption. (I'll post patches, if desired, once the thread is complete). In otherwise unmodified code, this change increased the number of pages I could process from 600 to over 1,000.

Another object that's hanging around a lot is the Page object. This one was more tricky and it took me ages to work out how to deal with it, but finally I discovered that it appears that the processing in FOP can be pipelined. So I hacked FOP to pipeline the format->render cycle, so as each page is formatted it is sent to the (PDF) renderer immediately. This hack allowed me to increase the page count to over 3,000 pages, but the problem is that the hack made the IDReferences always empty. More on that shortly.

It seems to me from inspection and my limited knowledge of what I'm doing, that it should be possible to completely pipeline FOP at the level of the fo:page-sequence, without major changes to FOP. My experience so far with pipelining the format-render steps was very positive (it worked first time!) and I am interested in looking at formatting immediately after receiving the </fo:page-sequence>. I also noted in my travels through the code that PDFRenderer seems to hang onto a lot of stuff that it could probably write out immediately, if it had a stream to write to. So, all this raises a few questions:

* How to deal with the IDReferences? (let the renderer deal with it? do two passes?). I don't fully understand the full purpose and implementation of IDReferences at the moment but at a guess it's used to resolve forward references in the FO file...? I'm sure that it's possible to at least optimize this.

* Is there something in XML:FO that means I can't process on receipt of </fo:page-sequence>?

* Am I just being stupid?

The benefits to this approach as I see it are:

* Probably not too many changes to FOP internals

* Will use significantly less memory even for small jobs

* Will increase the number of applications for FOP

I'm interested in continuing this line of experimental work if anyone's interested. As always kudos to the developers, I spent a very pleasant saturday with your code.

Mark Lillywhite

Reply via email to