´╗┐Thanks a lot for your reply last week Andreas. Sorry for the delay. Been away and offline... FYI to follow up on the work I was doing:

In the end I saw that references are indeed kept by the PDFDocument. So I decided it wouldn't do any harm (or take up any significant extra memory) to keep references to the objects themselves when I am constructing the balanced page tree. I have since modified PDFPages (and a small change in PDFPage) and the first working draft completed late yesterday keeps a list of sub-nodes (PDFPages, managed internally via a recursive algorithm - external methods work as before to avoid regressions) or leaves (PDFPage) as well as the original kids (may be a PDFPage or a sub PDFPages object) with PDF references to all children. This eliminates an overhead of looking up each object (potentially many times). I have successfully run it with test .fo files up to 10001 pages (each just showing 'Page x/y' where x is current page and y is total page count, takes a while with that many pages but not surprised) verifying that a balanced tree gets produced (and not a flat tree of one page tree object containing 10001 pages!). When each subnode is created the PDFFactory.makePages() method stores it in the trailer. That way the objects are all written out at the end after I have added all the pages to the right places, just before the cross reference table and trailer themselves are written. So now there are never more than 10 pages or 10 PDFPages (sub-nodes) per PDFPages object (I never mix sub-nodes and leaves on the same node). A similar structure to the page tree of the PDF 1.4 Reference document. Automatically generated on the fly.

So for example a 101 page document will have a root PDFPages node with two sub-nodes underneath. The first will contain a count of 100, and have 10 sub-nodes, each containing 10 pages. The second will simply contain 1 page. More new pages will get added to the second sub-node (moving pages down to new sub-nodes to avoid more than 10 pages per node) until it's count reaches 100 too, then another node created. Once 10 nodes under the root exist (at 1000 pages) they will get moved down below a new root level sub-node with a count of 1000, and a new root level sub-node created, and so on.

Next task is to write a JUnit test since one appears not to exist... I guess remaining thoughts currently are:

- Wondering if keeping references to a page tree object's sub-nodes or leaves is the best way or can I improve it further? (Bearing in mind memory usage and performance.) - Was wondering if the trailer objects list is the right place to write the new sub-node PDFPages objects. (But if writing an object to the objects list - addObject() instead of addTrailerObject() - it gets written out too soon before I have added all the pages.) But given how it writes the objects out before writing the xref and trailer it seems OK and parses and shows fine in PDFBox/PDFDebugger and the evince PDF Reader in ubuntu. - When registering the pages themselves via notifyKidsRegistered() method it extracts the page index number and puts the reference at that index in the kids list, filling empty spaces ahead of it with nulls. So when counting kids and writing out the pdf code text I had to ignore nulls and 'gaps' in the kids list since not all the kids are in the same list any more (spread across multiple page tree nodes). I was wondering why this method was written like this, and doesn't simply append new pages to the end of the list all the time.

Once testing is complete I'll submit the code internally for the in-team committers to review as I did with the 128 bit encryption work last month...



On 25/05/11 21:57, Andreas L. Delmelle wrote:
On 25 May 2011, at 09:45, Michael Rubin wrote:

Hi Mike

Hello there. In the PDFPages class the kids are stored as reference
strings (e.g. "23 0 R"). Each of these objects are PDFPage objects. Do
you know if there is a method somewhere that I can retrieve the PDF java
object based on the reference string?
Not really, AFAIK. What you do have is various Collections of different 
subtypes of PDFObject, available by means of accessors on PDFDocument.
I guess the closest you would get without too much effort is to obtain the one 
you're interested in, then iterate over its elements and check 
PDFObject.referencePDF() against the lookup string. You do have to know the 
type(s) of object you need in advance, though...

(I am aiming to add support for some of those kids being other PDFPages
nodes to create a more balanced page tree.)
Interesting. Looking forward to seeing more.



Michael Rubin

T: +44 20 8238 7400
F: +44 20 8238 7401


The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it.

Reply via email to