On 19/12/2011 07:20, Craig Ringer wrote:
Hi Craig,
Thanks for the detailed e-mail. This is indeed an interesting topic. As
I mentioned in an earlier e-mail this is something that we've recently
been working on too.
- A clean way to associate data that's private to the image
processing plugin with a particular rendering run so I can access it
across multiple invocations of the plugin; and
For anyone else who needs this later: There doesn't appear to be any
especially nice way to do this with FOP's current image handler API,
as there's no general-purpose map on the user agent for image handlers
to stash their data in and nothing like that is passed as a param to
the image handler calls. The hints mechanism can pass data from a
preloader to a loader for the same image, but it can't be used to pass
data between image loaders.
What I've landed up doing is keying a WeakHashMap off the FOUserAgent
for the rendering run, as obtained via the RenderingContext passed to
ImageHandler.handleImage(...). So long as lookups and insertions on
the WeakHashMap are synchronized this is safe and will release the
image handler's per-render information when the FOUserAgent is
discarded at the end of the rendering run.
I'm now able to accumulate font usage information from the PDFs I
examine as I embed them and build a list of which fonts are used. I
can combine width arrays and first/last char listings to determine
which glyphs are required if the font is to be embedded as a subset.
In our situation we had several large PDFs of around 50 pages each. What
happens in this case is that the plug-in uses PDF Box to split the PDF
into 50 separate PDFs which results in the same font being duplicated 50
times. We have solved this particular problem by altering the way the
cache works in the plug-in. However, this doesn't solve the wider
problem of fonts repeated in different input PDFs. Mehdi developed the
code and will be able to provide a patch after the holiday season.
- How to append some additional PDF objects after the last page is
emitted but before the PDF document trailer and final xref table(s)
are written out.
For anyone else looking at this now or later:
It's possible to allocate a PDFObject and request that it be written
out at the end of the document. PDFDocument.outputTrailer(...) writes
objects added to the trailer list. Those objects were allocated via
the factory where they were given an object ID, but were then passed
to addTrailerObject(...) to request that they be written out at the
end of document production. If I ever start producing my own combined
font subsets from the original subset fonts in the input PDFs, this is
probably how I'd insert the combined font subset object.
If I'm restricting font combining to fonts where fop has an original
font file and using fop's font subsystem the above would require too
much duplication and make it hard to avoid embedding fonts twice (once
for form xobjects, once for main content). Instead I need to mark a
font as used in fop's FontInfo for the rendering run so fop writes it
out, and I need to obtain the font object's PDF object ID so I can
write forward references to it in the XObject forms' resource
dictionaries.
The problem here is that fop doesn't assign fonts an object ID until
very late in writing. The first reference to font objects is from the
resource dictionary, and fop only writes one of those - it is shared
between all pages and written out just before the trailer. Since fonts
are written out with the resources dictionary and don't usually need
object IDs until the resources dictionary has to reference them
there's no way to get their object IDs earlier in PDF production. This
changes when we need to write private resource dictionaries for
embedded form xobjects.
I'm looking at forcing early embedding of fonts with direct
makeFont(...) calls. This'll work so long as I'm happy embedding whole
fonts, but will prevent fop from subsetting the font for its own use
and prevent me from subsetting it for xobject forms.
Alternately, I could defer the writing of the xobject form resource
dictionaries till the end of the document so I didn't need to know the
font object IDs early - but I'd still need a way to write them *after*
the main fop resource dictionary. If I wanted to subset then I'd also
need a hook for just before fonts were written out by fop to adjust
the glyph width tables. I don't see any way around this without some
kind of PDF renderer listener for image handlers etc to use.
I'll try to put together a proof of concept that embeds whole fonts if
the font is found in a pdf form xobject, de-duplicating references so
all pdf form xobjects that use that font reference the same one. Fop
will use the same font since it knows about it and has stored it in
the used fonts map, so the only problem is that the whole font is
embedded rather than a subset.
FOP can't currently fully embed a font in PDF, so even if you had the
source font available the code changes required could be extensive. For
us, this approach isn't an option because we don't have the source font
to register in fop.xconf and embed. Therefore I am interested in knowing
what you've come up with in terms of merging subsets together to create
1 super subset. That in my view is the most difficult challenge in this
problem. Resolving the problems with the cross references and the point
at which IDs are assigned should be solvable with a little code
refactoring. I'm sure one of the guys will speak up if that's not the case.
Anyone working on the same thing, please feel free to drop me a note.
Thanks,
Chris
--
Craig Ringer