On 08 Jun 2011, at 17:15, Michael Rubin wrote:

Hi Mike

> Hello there. Thought I'd post an update. Admittedly I feel like I've found a 
> bit of a catch 22 situation. I successfully completed my code to generate the 
> balanced page tree on the fly and it works fine with a single page sequence. 
> However, this morning I discovered that this code does not appear to work for 
> multiple page sequences in a flow. (2x 101 page sequences, I got pages 1-9, 
> 102, 10-101 then 103-end in that order...) I guess this is where pages can 
> come in in a different order anyway then, and why the current indexing / 
> nulls system is there.

Ouch! I had not considered that to be the purpose. Without looking closer, I 
would say something like: page 10 contains a forward reference to page 102, and 
all pages in between are only flushed after the reference can be been resolved 

> (And shows that I am still learning the ropes as I go along...)

Yep, and also shows that I am not intimately familiar with *all* of the 
codebase myself. ;-)

> So I re-examined trying to generate the page tree after the pages have been 
> added into one big flat list. I can do this by, in 
> PDFDocument.outputTrailer(), calling a method to balance the page tree before 
> all the remaining objects are written out. This way pages can be attached to 
> nodes, and the tree hierarchy built up to the root node. This is on paper a 
> more elegant, efficient and easier solution to doing it on the fly. But I ran 
> into the same problem again - the page objects are already written out.

OK, here may be a gap in my understanding of it so far, but...
Do you really _need_ the PDFPage object for some reason, or does its PDF 
reference suffice to build the page tree?
From what I know of PDF, that page tree would only contain the references to 
the actual page objects, no? As long as the PDFPages object is not written to 
the stream, you should be able to shuffle and play with the references all you 
want. All you need to keep track of, is to retain the natural order (= the 
page's index), as the object numbers will not necessarily reflect that.
Unless I am mistaken about this, I do not see a compelling reason *not* to 
write the PDFPage object to the stream as soon as it's finished. We keep a 
mapping of reference-to-index alive in the 'main' (temporary?) PDFPages object.
Note that notifyKidRegistered() only stores the reference; the natural index is 
translated into the position of the reference in the list. If you want to 
re-shape that into a structured tree/map, then by all means...

Perhaps there is still a catch --sounds too simple somehow... :-/

> <snip />
> My current questions are:
> -Why are the page objects flushed straight away? (Memory constraints?)

Very likely to save memory indeed. More with the intention of just flushing "as 
soon as possible", to support full streaming processing if the document 
structure allows it. Theoretically, in a document consisting of single-page 
fo:page-sequences, without any cross-references, you should see relatively low 
memory usage even if the document is 10000+ pages, precisely because the pages 
are all written to the output immediately, long before the root page tree, 
which only retains their object references. 

> -Is it safe and wise to delay flushing the page objects until the end?

Safe? No issue here.
Wise? That would obviously depend on the context. 
In documents with 1000s of pages, I can imagine we do not want to keep all of 
those pages in memory any longer than strictly necessary... I wouldn't mind too 
much if it were an option that users could switch on/off. However, if the 
process is hard coded as the *only* way FOP will render PDFs, such that it 
would affect *all* users, I am not so sure it is wise to do this.

<snip />



Reply via email to