Re: Retrieving Objects question

Michael Rubin Thu, 09 Jun 2011 00:50:38 -0700

Thanks a lot for your reply Andreas. Yes if all I had to do was move 
references around then my work would already be complete and submitted for 
review. However, that catch is that the Page objects also have Parent 
references which also need to be updated when they get moved from one page tree 
node to another. But since they have been written out already this cannot be 
done. So the pages effectively become immovable (or else the parent references 
will not match the kids references as they will be out of date - which was why 
acroread could not open the pages).


Delaying writing the page objects would mean the parent references can be 
updated correctly, and the problem would be solved. But, that has a potential 
memory usage toll.

Today I will continue with my attempt to link every page to a node of its own 
(stored in a flat list), then re-order the nodes according to the page index of 
the page inside. Then build up the balanced page tree from those nodes up. 
That's the plan anyway... (I'll also be interested time permitting in looking 
more closely at what happened when the 2 page sequences ended up with mixed up 
pages...)

Thanks!

-Mike


On 08/06/11 20:14, Andreas L. Delmelle wrote:

On 08 Jun 2011, at 17:15, Michael Rubin wrote:

Hi Mike

Hello there. Thought I'd post an update. Admittedly I feel like I've found a 
bit of a catch 22 situation. I successfully completed my code to generate the 
balanced page tree on the fly and it works fine with a single page sequence. 
However, this morning I discovered that this code does not appear to work for 
multiple page sequences in a flow. (2x 101 page sequences, I got pages 1-9, 
102, 10-101 then 103-end in that order...) I guess this is where pages can come 
in in a different order anyway then, and why the current indexing / nulls 
system is there.

Ouch! I had not considered that to be the purpose. Without looking closer, I 
would say something like: page 10 contains a forward reference to page 102, and 
all pages in between are only flushed after the reference can be been resolved 
(?)

(And shows that I am still learning the ropes as I go along...)

Yep, and also shows that I am not intimately familiar with *all* of the 
codebase myself. ;-)

So I re-examined trying to generate the page tree after the pages have been 
added into one big flat list. I can do this by, in PDFDocument.outputTrailer(), 
calling a method to balance the page tree before all the remaining objects are 
written out. This way pages can be attached to nodes, and the tree hierarchy 
built up to the root node. This is on paper a more elegant, efficient and 
easier solution to doing it on the fly. But I ran into the same problem again - 
the page objects are already written out.

OK, here may be a gap in my understanding of it so far, but...
Do you really _need_ the PDFPage object for some reason, or does its PDF 
reference suffice to build the page tree?
 From what I know of PDF, that page tree would only contain the references to 
the actual page objects, no? As long as the PDFPages object is not written to 
the stream, you should be able to shuffle and play with the references all you 
want. All you need to keep track of, is to retain the natural order (= the 
page's index), as the object numbers will not necessarily reflect that.
Unless I am mistaken about this, I do not see a compelling reason *not* to 
write the PDFPage object to the stream as soon as it's finished. We keep a 
mapping of reference-to-index alive in the 'main' (temporary?) PDFPages object.
Note that notifyKidRegistered() only stores the reference; the natural index is 
translated into the position of the reference in the list. If you want to 
re-shape that into a structured tree/map, then by all means...

Perhaps there is still a catch --sounds too simple somehow... :-/

<snip />
My current questions are:

-Why are the page objects flushed straight away? (Memory constraints?)

Very likely to save memory indeed. More with the intention of just flushing "as soon 
as possible", to support full streaming processing if the document structure allows 
it. Theoretically, in a document consisting of single-page fo:page-sequences, without any 
cross-references, you should see relatively low memory usage even if the document is 
10000+ pages, precisely because the pages are all written to the output immediately, long 
before the root page tree, which only retains their object references.

-Is it safe and wise to delay flushing the page objects until the end?

Safe? No issue here.
Wise? That would obviously depend on the context.
In documents with 1000s of pages, I can imagine we do not want to keep all of 
those pages in memory any longer than strictly necessary... I wouldn't mind too 
much if it were an option that users could switch on/off. However, if the 
process is hard coded as the *only* way FOP will render PDFs, such that it 
would affect *all* users, I am not so sure it is wise to do this.

<snip />


Regards

Andreas
---






Michael Rubin
Developer

T: +44 20 8238 7400
F: +44 20 8238 7401

[email protected]

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify usimmediately and then destroy it.

Re: Retrieving Objects question

Reply via email to