Hi

As requsted by Mehdi Houshmand I'm elaborating on the issue we've been
running into with fop-pdf-image. I've asked about aspects of it on the
list before, but now have a better understanding of what's going on.

Where input pdfs being used as form XObjects contain embedded subset
fonts, I'm seeing many copies of those fonts being embedded in the
output document. This creates huge output files with lots of duplicate
font data, and in a few cases has even crashed the RIP used by my work's
offset press printer. I think they use a Firey, but struggle to get any
more info than that out of them.

The issue is that fop-pdf-image copies PDFs into fop output PDFs by
copying the content stream and resources dictionary verbatim from the
page being extracted from the input PDF, translating it from PDFBox into
fop PDF structures in the process. This is extremely reliable, ensuring
that fop-pdf-image form XObjects don't conflict with / interfere with
the embedding page or vice versa. Unfortunately it also leads to massive
duplication of data, including:

- Fonts, both subsets and fully embedded fonts
- Embedded ICC profiles, if present
- Images re-used across multiple pages or documents

In the case of images, ICC profiles, and fully embedded fonts it'd
potentially be relatively easy to coalesce these so that all resources
dictionaries refer to the same object. It's a little hacky because fop
doesn't give image plugins any "official" way to store data about a
rendering run for later reference, but it's easy enough to do by storing
a WeakHashMap<FOUserAgent,...> associating object type and checksum data
with a particular rendering run. I haven't implemented coalescing of
images and profiles because it's not part of my problem space, but it
shouldn't be too hard.

Unfortunately, the above approach doesn't work for our problem, which is
duplicated *subset* fonts. There are 20 or 30 copies of Helvetica
Regular alone in one of our typical runs, with a mixture of MacRoman,
Custom and WinAnsi encodings. They're drawn from the same two or three
copies of Helvetica from different sources, but each subset has a
different (though largely overlapping) glyph set. Fop-pdf-image
correctly but rather sub-optimally copies each subset and references it
from the associated Form XObject, creating working output but lots of
wasted space and duplication. We can't just write the font out the first
time we see it and adjust all future references to the copy we've
already written, because unlike with ICC profiles and repeatedly used
images each copy is different.

I see two possible solutions to this problem. Both have the same
pre-requisites:

(1) A mechanism for image plugins to keep plugin-specific data
associated with a specific rendering run. A WeakHashMap<FOUserAgent,...>
works for this, though it isn't pretty.

(2) Code in the image plugin to record each use of each font and group
usages up into compatible groups so all font references in the group can
point to the same font in the output. This code can also collect up
glyph usage information, producing a map of which glyphs are required by
one or more content streams.

(3) A way to create a new embedded font in the output, either by
combining input subsets into a single new subset font object or by
loading a whole font off the HDD and making a new subset with just the
required glyphs from it.

(4) Some way to be notified, at minimum, just before the xref table is
going to be written out, so the new font can be written to the output
stream. The new font can't be written until we know the last embedded
PDF has been written out, because a future pdf might add use additional
glpyhs that must be added to the subset.

(5) [Optional but useful] Smarter font loading where more than just
(family, weight, slant) 3-tuples are used to match fonts, so I can use
fop's font loading and cache code to see whether there's a whole font
available to fop that can be substituted for an embedded subset. For
example, I might need to match Myriad Pro Ultrabold Italic SemiCond, a
small caps variant face, or similar with no confusion between different
condensed/expanded versions of the same face, different specialist
variants, etc. Right now fop's font matching code simply cannot do that,
so I can't really create new font subsets as an alternative for (3) and
have to try to combine subsets from the input instead.


I have (1) working and I have a prototype of (2) that dumps font usage
data for a run including a glyph usage map. I was trying to avoid (3)
for Base14 fonts by just replacing the Resources reference to the font
with a base14 font ref, but PDF readers seem to choke on this for
reasons I haven't yet determined.

(4) is the big problem. I can't do a proper implementation of (3)
without some way to write the produced font out at the end.

For (4) I'd really appreciate advice from the fop community. I need a
way for a plugin to hook into output just before the xref table is
written, so it can write new objects to the pdf stream. The object
numbers for the fonts to be written out will have been reserved the
first time that font was seen, I just need to write the data out and
record the offset for the xref entry. As the data to be written is not
known until the last embedded form XObject is known to have been
written, the hook must be before xref write-out.

To resolve (5), the whole FontTriplet assumption must be ripped out of
the code and replaced with a more flexible representation of font info
that is at minimum (Family, Weight, Slant, Expand/Condense amount,
Variant) and probably needs an extensible map of additional matching
characteristics for future-proofing too. This doesn't look like a fun
thing to do!


Right now, an answer to (4) would give me a chance to progress on
de-duplicating fonts by attempting to combine subsets. I don't know if
I'll be able to successfully combine subsets together, but I have more
of a chance of that than I do making new subsets when I can't match
fonts reliably enough.

Ideas?
--
Craig Ringer


Reply via email to