Re: Document and page callbacks for image handlers

Chris Bowditch Wed, 21 Dec 2011 01:08:15 -0800

On 19/12/2011 07:20, Craig Ringer wrote:

Hi Craig,

Thanks for the detailed e-mail. This is indeed an interesting topic. AsI mentioned in an earlier e-mail this is something that we've recentlybeen working on too.

- A clean way to associate data that's private to the imageprocessing plugin with a particular rendering run so I can access itacross multiple invocations of the plugin; and
For anyone else who needs this later: There doesn't appear to be anyespecially nice way to do this with FOP's current image handler API,as there's no general-purpose map on the user agent for image handlersto stash their data in and nothing like that is passed as a param tothe image handler calls. The hints mechanism can pass data from apreloader to a loader for the same image, but it can't be used to passdata between image loaders.
What I've landed up doing is keying a WeakHashMap off the FOUserAgentfor the rendering run, as obtained via the RenderingContext passed toImageHandler.handleImage(...). So long as lookups and insertions onthe WeakHashMap are synchronized this is safe and will release theimage handler's per-render information when the FOUserAgent isdiscarded at the end of the rendering run.
I'm now able to accumulate font usage information from the PDFs Iexamine as I embed them and build a list of which fonts are used. Ican combine width arrays and first/last char listings to determinewhich glyphs are required if the font is to be embedded as a subset.

In our situation we had several large PDFs of around 50 pages each. Whathappens in this case is that the plug-in uses PDF Box to split the PDFinto 50 separate PDFs which results in the same font being duplicated 50times. We have solved this particular problem by altering the way thecache works in the plug-in. However, this doesn't solve the widerproblem of fonts repeated in different input PDFs. Mehdi developed thecode and will be able to provide a patch after the holiday season.

- How to append some additional PDF objects after the last page isemitted but before the PDF document trailer and final xref table(s)are written out.
For anyone else looking at this now or later:
It's possible to allocate a PDFObject and request that it be writtenout at the end of the document. PDFDocument.outputTrailer(...) writesobjects added to the trailer list. Those objects were allocated viathe factory where they were given an object ID, but were then passedto addTrailerObject(...) to request that they be written out at theend of document production. If I ever start producing my own combinedfont subsets from the original subset fonts in the input PDFs, this isprobably how I'd insert the combined font subset object.
If I'm restricting font combining to fonts where fop has an originalfont file and using fop's font subsystem the above would require toomuch duplication and make it hard to avoid embedding fonts twice (oncefor form xobjects, once for main content). Instead I need to mark afont as used in fop's FontInfo for the rendering run so fop writes itout, and I need to obtain the font object's PDF object ID so I canwrite forward references to it in the XObject forms' resourcedictionaries.
The problem here is that fop doesn't assign fonts an object ID untilvery late in writing. The first reference to font objects is from theresource dictionary, and fop only writes one of those - it is sharedbetween all pages and written out just before the trailer. Since fontsare written out with the resources dictionary and don't usually needobject IDs until the resources dictionary has to reference themthere's no way to get their object IDs earlier in PDF production. Thischanges when we need to write private resource dictionaries forembedded form xobjects.
I'm looking at forcing early embedding of fonts with directmakeFont(...) calls. This'll work so long as I'm happy embedding wholefonts, but will prevent fop from subsetting the font for its own useand prevent me from subsetting it for xobject forms.
Alternately, I could defer the writing of the xobject form resourcedictionaries till the end of the document so I didn't need to know thefont object IDs early - but I'd still need a way to write them *after*the main fop resource dictionary. If I wanted to subset then I'd alsoneed a hook for just before fonts were written out by fop to adjustthe glyph width tables. I don't see any way around this without somekind of PDF renderer listener for image handlers etc to use.
I'll try to put together a proof of concept that embeds whole fonts ifthe font is found in a pdf form xobject, de-duplicating references soall pdf form xobjects that use that font reference the same one. Fopwill use the same font since it knows about it and has stored it inthe used fonts map, so the only problem is that the whole font isembedded rather than a subset.

FOP can't currently fully embed a font in PDF, so even if you had thesource font available the code changes required could be extensive. Forus, this approach isn't an option because we don't have the source fontto register in fop.xconf and embed. Therefore I am interested in knowingwhat you've come up with in terms of merging subsets together to create1 super subset. That in my view is the most difficult challenge in thisproblem. Resolving the problems with the cross references and the pointat which IDs are assigned should be solvable with a little coderefactoring. I'm sure one of the guys will speak up if that's not the case.


Anyone working on the same thing, please feel free to drop me a note.


Thanks,

Chris


--
Craig Ringer

Re: Document and page callbacks for image handlers

Reply via email to