Re: Retrieving Objects question

Michael Rubin Fri, 03 Jun 2011 01:55:26 -0700

Thanks a lot for your reply last week Andreas. Sorry for the delay. Beenaway and offline... FYI to follow up on the work I was doing:

In the end I saw that references are indeed kept by the PDFDocument. SoI decided it wouldn't do any harm (or take up any significant extramemory) to keep references to the objects themselves when I amconstructing the balanced page tree. I have since modified PDFPages (anda small change in PDFPage) and the first working draft completed lateyesterday keeps a list of sub-nodes (PDFPages, managed internally via arecursive algorithm - external methods work as before to avoidregressions) or leaves (PDFPage) as well as the original kids (may be aPDFPage or a sub PDFPages object) with PDF references to all children.This eliminates an overhead of looking up each object (potentially manytimes). I have successfully run it with test .fo files up to 10001 pages(each just showing 'Page x/y' where x is current page and y is totalpage count, takes a while with that many pages but not surprised)verifying that a balanced tree gets produced (and not a flat tree of onepage tree object containing 10001 pages!). When each subnode is createdthe PDFFactory.makePages() method stores it in the trailer. That way theobjects are all written out at the end after I have added all the pagesto the right places, just before the cross reference table and trailerthemselves are written. So now there are never more than 10 pages or 10PDFPages (sub-nodes) per PDFPages object (I never mix sub-nodes andleaves on the same node). A similar structure to the page tree of thePDF 1.4 Reference document. Automatically generated on the fly.

So for example a 101 page document will have a root PDFPages node withtwo sub-nodes underneath. The first will contain a count of 100, andhave 10 sub-nodes, each containing 10 pages. The second will simplycontain 1 page. More new pages will get added to the second sub-node(moving pages down to new sub-nodes to avoid more than 10 pages pernode) until it's count reaches 100 too, then another node created. Once10 nodes under the root exist (at 1000 pages) they will get moved downbelow a new root level sub-node with a count of 1000, and a new rootlevel sub-node created, and so on.

Next task is to write a JUnit test since one appears not to exist... Iguess remaining thoughts currently are:

- Wondering if keeping references to a page tree object's sub-nodes orleaves is the best way or can I improve it further? (Bearing in mindmemory usage and performance.)- Was wondering if the trailer objects list is the right place to writethe new sub-node PDFPages objects. (But if writing an object to theobjects list - addObject() instead of addTrailerObject() - it getswritten out too soon before I have added all the pages.) But given howit writes the objects out before writing the xref and trailer it seemsOK and parses and shows fine in PDFBox/PDFDebugger and the evince PDFReader in ubuntu.- When registering the pages themselves via notifyKidsRegistered()method it extracts the page index number and puts the reference at thatindex in the kids list, filling empty spaces ahead of it with nulls. Sowhen counting kids and writing out the pdf code text I had to ignorenulls and 'gaps' in the kids list since not all the kids are in the samelist any more (spread across multiple page tree nodes). I was wonderingwhy this method was written like this, and doesn't simply append newpages to the end of the list all the time.

Once testing is complete I'll submit the code internally for the in-teamcommitters to review as I did with the 128 bit encryption work last month...


Thanks!

-Mike

On 25/05/11 21:57, Andreas L. Delmelle wrote:

On 25 May 2011, at 09:45, Michael Rubin wrote:

Hi Mike

Hello there. In the PDFPages class the kids are stored as reference
strings (e.g. "23 0 R"). Each of these objects are PDFPage objects. Do
you know if there is a method somewhere that I can retrieve the PDF java
object based on the reference string?

Not really, AFAIK. What you do have is various Collections of different 
subtypes of PDFObject, available by means of accessors on PDFDocument.
I guess the closest you would get without too much effort is to obtain the one 
you're interested in, then iterate over its elements and check 
PDFObject.referencePDF() against the lookup string. You do have to know the 
type(s) of object you need in advance, though...

(I am aiming to add support for some of those kids being other PDFPages
nodes to create a more balanced page tree.)

Interesting. Looking forward to seeing more.


Regards

Andreas
---






Michael Rubin
Developer

T: +44 20 8238 7400
F: +44 20 8238 7401

[email protected]

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify usimmediately and then destroy it.

Re: Retrieving Objects question

Reply via email to