Thanks a lot for your reply last week Andreas. Sorry for the delay. Been
away and offline... FYI to follow up on the work I was doing:
In the end I saw that references are indeed kept by the PDFDocument. So
I decided it wouldn't do any harm (or take up any significant extra
memory) to keep references to the objects themselves when I am
constructing the balanced page tree. I have since modified PDFPages (and
a small change in PDFPage) and the first working draft completed late
yesterday keeps a list of sub-nodes (PDFPages, managed internally via a
recursive algorithm - external methods work as before to avoid
regressions) or leaves (PDFPage) as well as the original kids (may be a
PDFPage or a sub PDFPages object) with PDF references to all children.
This eliminates an overhead of looking up each object (potentially many
times). I have successfully run it with test .fo files up to 10001 pages
(each just showing 'Page x/y' where x is current page and y is total
page count, takes a while with that many pages but not surprised)
verifying that a balanced tree gets produced (and not a flat tree of one
page tree object containing 10001 pages!). When each subnode is created
the PDFFactory.makePages() method stores it in the trailer. That way the
objects are all written out at the end after I have added all the pages
to the right places, just before the cross reference table and trailer
themselves are written. So now there are never more than 10 pages or 10
PDFPages (sub-nodes) per PDFPages object (I never mix sub-nodes and
leaves on the same node). A similar structure to the page tree of the
PDF 1.4 Reference document. Automatically generated on the fly.
So for example a 101 page document will have a root PDFPages node with
two sub-nodes underneath. The first will contain a count of 100, and
have 10 sub-nodes, each containing 10 pages. The second will simply
contain 1 page. More new pages will get added to the second sub-node
(moving pages down to new sub-nodes to avoid more than 10 pages per
node) until it's count reaches 100 too, then another node created. Once
10 nodes under the root exist (at 1000 pages) they will get moved down
below a new root level sub-node with a count of 1000, and a new root
level sub-node created, and so on.
Next task is to write a JUnit test since one appears not to exist... I
guess remaining thoughts currently are:
- Wondering if keeping references to a page tree object's sub-nodes or
leaves is the best way or can I improve it further? (Bearing in mind
memory usage and performance.)
- Was wondering if the trailer objects list is the right place to write
the new sub-node PDFPages objects. (But if writing an object to the
objects list - addObject() instead of addTrailerObject() - it gets
written out too soon before I have added all the pages.) But given how
it writes the objects out before writing the xref and trailer it seems
OK and parses and shows fine in PDFBox/PDFDebugger and the evince PDF
Reader in ubuntu.
- When registering the pages themselves via notifyKidsRegistered()
method it extracts the page index number and puts the reference at that
index in the kids list, filling empty spaces ahead of it with nulls. So
when counting kids and writing out the pdf code text I had to ignore
nulls and 'gaps' in the kids list since not all the kids are in the same
list any more (spread across multiple page tree nodes). I was wondering
why this method was written like this, and doesn't simply append new
pages to the end of the list all the time.
Once testing is complete I'll submit the code internally for the in-team
committers to review as I did with the 128 bit encryption work last month...
Thanks!
-Mike
On 25/05/11 21:57, Andreas L. Delmelle wrote:
On 25 May 2011, at 09:45, Michael Rubin wrote:
Hi Mike
Hello there. In the PDFPages class the kids are stored as reference
strings (e.g. "23 0 R"). Each of these objects are PDFPage objects. Do
you know if there is a method somewhere that I can retrieve the PDF java
object based on the reference string?
Not really, AFAIK. What you do have is various Collections of different
subtypes of PDFObject, available by means of accessors on PDFDocument.
I guess the closest you would get without too much effort is to obtain the one
you're interested in, then iterate over its elements and check
PDFObject.referencePDF() against the lookup string. You do have to know the
type(s) of object you need in advance, though...
(I am aiming to add support for some of those kids being other PDFPages
nodes to create a more balanced page tree.)
Interesting. Looking forward to seeing more.
Regards
Andreas
---
Michael Rubin
Developer
T: +44 20 8238 7400
F: +44 20 8238 7401
[email protected]
The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us
immediately and then destroy it.