On 19/12/2011 07:20, Craig Ringer wrote:

Hi Craig,

Thanks for the detailed e-mail. This is indeed an interesting topic. As I mentioned in an earlier e-mail this is something that we've recently been working on too.

- A clean way to associate data that's private to the image processing plugin with a particular rendering run so I can access it across multiple invocations of the plugin; and

For anyone else who needs this later: There doesn't appear to be any especially nice way to do this with FOP's current image handler API, as there's no general-purpose map on the user agent for image handlers to stash their data in and nothing like that is passed as a param to the image handler calls. The hints mechanism can pass data from a preloader to a loader for the same image, but it can't be used to pass data between image loaders.

What I've landed up doing is keying a WeakHashMap off the FOUserAgent for the rendering run, as obtained via the RenderingContext passed to ImageHandler.handleImage(...). So long as lookups and insertions on the WeakHashMap are synchronized this is safe and will release the image handler's per-render information when the FOUserAgent is discarded at the end of the rendering run.

I'm now able to accumulate font usage information from the PDFs I examine as I embed them and build a list of which fonts are used. I can combine width arrays and first/last char listings to determine which glyphs are required if the font is to be embedded as a subset.

In our situation we had several large PDFs of around 50 pages each. What happens in this case is that the plug-in uses PDF Box to split the PDF into 50 separate PDFs which results in the same font being duplicated 50 times. We have solved this particular problem by altering the way the cache works in the plug-in. However, this doesn't solve the wider problem of fonts repeated in different input PDFs. Mehdi developed the code and will be able to provide a patch after the holiday season.



- How to append some additional PDF objects after the last page is emitted but before the PDF document trailer and final xref table(s) are written out.

For anyone else looking at this now or later:

It's possible to allocate a PDFObject and request that it be written out at the end of the document. PDFDocument.outputTrailer(...) writes objects added to the trailer list. Those objects were allocated via the factory where they were given an object ID, but were then passed to addTrailerObject(...) to request that they be written out at the end of document production. If I ever start producing my own combined font subsets from the original subset fonts in the input PDFs, this is probably how I'd insert the combined font subset object.

If I'm restricting font combining to fonts where fop has an original font file and using fop's font subsystem the above would require too much duplication and make it hard to avoid embedding fonts twice (once for form xobjects, once for main content). Instead I need to mark a font as used in fop's FontInfo for the rendering run so fop writes it out, and I need to obtain the font object's PDF object ID so I can write forward references to it in the XObject forms' resource dictionaries.

The problem here is that fop doesn't assign fonts an object ID until very late in writing. The first reference to font objects is from the resource dictionary, and fop only writes one of those - it is shared between all pages and written out just before the trailer. Since fonts are written out with the resources dictionary and don't usually need object IDs until the resources dictionary has to reference them there's no way to get their object IDs earlier in PDF production. This changes when we need to write private resource dictionaries for embedded form xobjects.

I'm looking at forcing early embedding of fonts with direct makeFont(...) calls. This'll work so long as I'm happy embedding whole fonts, but will prevent fop from subsetting the font for its own use and prevent me from subsetting it for xobject forms.

Alternately, I could defer the writing of the xobject form resource dictionaries till the end of the document so I didn't need to know the font object IDs early - but I'd still need a way to write them *after* the main fop resource dictionary. If I wanted to subset then I'd also need a hook for just before fonts were written out by fop to adjust the glyph width tables. I don't see any way around this without some kind of PDF renderer listener for image handlers etc to use.

I'll try to put together a proof of concept that embeds whole fonts if the font is found in a pdf form xobject, de-duplicating references so all pdf form xobjects that use that font reference the same one. Fop will use the same font since it knows about it and has stored it in the used fonts map, so the only problem is that the whole font is embedded rather than a subset.

FOP can't currently fully embed a font in PDF, so even if you had the source font available the code changes required could be extensive. For us, this approach isn't an option because we don't have the source font to register in fop.xconf and embed. Therefore I am interested in knowing what you've come up with in terms of merging subsets together to create 1 super subset. That in my view is the most difficult challenge in this problem. Resolving the problems with the cross references and the point at which IDs are assigned should be solvable with a little code refactoring. I'm sure one of the guys will speak up if that's not the case.


Anyone working on the same thing, please feel free to drop me a note.

Thanks,

Chris


--
Craig Ringer



Reply via email to