Thanks a lot for your reply Andreas. Yes if all I had to do was move
references around then my work would already be complete and submitted for
review. However, that catch is that the Page objects also have Parent
references which also need to be updated when they get moved from one page tree
node to another. But since they have been written out already this cannot be
done. So the pages effectively become immovable (or else the parent references
will not match the kids references as they will be out of date - which was why
acroread could not open the pages).
Delaying writing the page objects would mean the parent references can be
updated correctly, and the problem would be solved. But, that has a potential
memory usage toll.
Today I will continue with my attempt to link every page to a node of its own
(stored in a flat list), then re-order the nodes according to the page index of
the page inside. Then build up the balanced page tree from those nodes up.
That's the plan anyway... (I'll also be interested time permitting in looking
more closely at what happened when the 2 page sequences ended up with mixed up
pages...)
Thanks!
-Mike
On 08/06/11 20:14, Andreas L. Delmelle wrote:
On 08 Jun 2011, at 17:15, Michael Rubin wrote:
Hi Mike
Hello there. Thought I'd post an update. Admittedly I feel like I've found a
bit of a catch 22 situation. I successfully completed my code to generate the
balanced page tree on the fly and it works fine with a single page sequence.
However, this morning I discovered that this code does not appear to work for
multiple page sequences in a flow. (2x 101 page sequences, I got pages 1-9,
102, 10-101 then 103-end in that order...) I guess this is where pages can come
in in a different order anyway then, and why the current indexing / nulls
system is there.
Ouch! I had not considered that to be the purpose. Without looking closer, I
would say something like: page 10 contains a forward reference to page 102, and
all pages in between are only flushed after the reference can be been resolved
(?)
(And shows that I am still learning the ropes as I go along...)
Yep, and also shows that I am not intimately familiar with *all* of the
codebase myself. ;-)
So I re-examined trying to generate the page tree after the pages have been
added into one big flat list. I can do this by, in PDFDocument.outputTrailer(),
calling a method to balance the page tree before all the remaining objects are
written out. This way pages can be attached to nodes, and the tree hierarchy
built up to the root node. This is on paper a more elegant, efficient and
easier solution to doing it on the fly. But I ran into the same problem again -
the page objects are already written out.
OK, here may be a gap in my understanding of it so far, but...
Do you really _need_ the PDFPage object for some reason, or does its PDF
reference suffice to build the page tree?
From what I know of PDF, that page tree would only contain the references to
the actual page objects, no? As long as the PDFPages object is not written to
the stream, you should be able to shuffle and play with the references all you
want. All you need to keep track of, is to retain the natural order (= the
page's index), as the object numbers will not necessarily reflect that.
Unless I am mistaken about this, I do not see a compelling reason *not* to
write the PDFPage object to the stream as soon as it's finished. We keep a
mapping of reference-to-index alive in the 'main' (temporary?) PDFPages object.
Note that notifyKidRegistered() only stores the reference; the natural index is
translated into the position of the reference in the list. If you want to
re-shape that into a structured tree/map, then by all means...
Perhaps there is still a catch --sounds too simple somehow... :-/
snip /
My current questions are:
-Why are the page objects flushed straight away? (Memory constraints?)
Very likely to save memory indeed. More with the intention of just flushing as soon
as possible, to support full streaming processing if the document structure allows
it. Theoretically, in a document consisting of single-page fo:page-sequences, without any
cross-references, you should see relatively low memory usage even if the document is
1+ pages, precisely because the pages are all written to the output immediately, long
before the root page tree, which only retains their object references.
-Is it safe and wise to delay flushing the page objects until the end?
Safe? No issue here.
Wise? That would obviously depend on the context.
In documents with 1000s of pages, I can imagine we do not want to keep all of
those pages in memory any longer than strictly necessary... I wouldn't mind too
much if it were an option that users could switch on/off. However, if the
process is hard coded as the *only* way FOP will render PDFs, such that it
would affect *all* users, I am not so sure it is wise to do this.
snip /