Hi As requsted by Mehdi Houshmand I'm elaborating on the issue we've been running into with fop-pdf-image. I've asked about aspects of it on the list before, but now have a better understanding of what's going on.
Where input pdfs being used as form XObjects contain embedded subset fonts, I'm seeing many copies of those fonts being embedded in the output document. This creates huge output files with lots of duplicate font data, and in a few cases has even crashed the RIP used by my work's offset press printer. I think they use a Firey, but struggle to get any more info than that out of them. The issue is that fop-pdf-image copies PDFs into fop output PDFs by copying the content stream and resources dictionary verbatim from the page being extracted from the input PDF, translating it from PDFBox into fop PDF structures in the process. This is extremely reliable, ensuring that fop-pdf-image form XObjects don't conflict with / interfere with the embedding page or vice versa. Unfortunately it also leads to massive duplication of data, including: - Fonts, both subsets and fully embedded fonts - Embedded ICC profiles, if present - Images re-used across multiple pages or documents In the case of images, ICC profiles, and fully embedded fonts it'd potentially be relatively easy to coalesce these so that all resources dictionaries refer to the same object. It's a little hacky because fop doesn't give image plugins any "official" way to store data about a rendering run for later reference, but it's easy enough to do by storing a WeakHashMap<FOUserAgent,...> associating object type and checksum data with a particular rendering run. I haven't implemented coalescing of images and profiles because it's not part of my problem space, but it shouldn't be too hard. Unfortunately, the above approach doesn't work for our problem, which is duplicated *subset* fonts. There are 20 or 30 copies of Helvetica Regular alone in one of our typical runs, with a mixture of MacRoman, Custom and WinAnsi encodings. They're drawn from the same two or three copies of Helvetica from different sources, but each subset has a different (though largely overlapping) glyph set. Fop-pdf-image correctly but rather sub-optimally copies each subset and references it from the associated Form XObject, creating working output but lots of wasted space and duplication. We can't just write the font out the first time we see it and adjust all future references to the copy we've already written, because unlike with ICC profiles and repeatedly used images each copy is different. I see two possible solutions to this problem. Both have the same pre-requisites: (1) A mechanism for image plugins to keep plugin-specific data associated with a specific rendering run. A WeakHashMap<FOUserAgent,...> works for this, though it isn't pretty. (2) Code in the image plugin to record each use of each font and group usages up into compatible groups so all font references in the group can point to the same font in the output. This code can also collect up glyph usage information, producing a map of which glyphs are required by one or more content streams. (3) A way to create a new embedded font in the output, either by combining input subsets into a single new subset font object or by loading a whole font off the HDD and making a new subset with just the required glyphs from it. (4) Some way to be notified, at minimum, just before the xref table is going to be written out, so the new font can be written to the output stream. The new font can't be written until we know the last embedded PDF has been written out, because a future pdf might add use additional glpyhs that must be added to the subset. (5) [Optional but useful] Smarter font loading where more than just (family, weight, slant) 3-tuples are used to match fonts, so I can use fop's font loading and cache code to see whether there's a whole font available to fop that can be substituted for an embedded subset. For example, I might need to match Myriad Pro Ultrabold Italic SemiCond, a small caps variant face, or similar with no confusion between different condensed/expanded versions of the same face, different specialist variants, etc. Right now fop's font matching code simply cannot do that, so I can't really create new font subsets as an alternative for (3) and have to try to combine subsets from the input instead. I have (1) working and I have a prototype of (2) that dumps font usage data for a run including a glyph usage map. I was trying to avoid (3) for Base14 fonts by just replacing the Resources reference to the font with a base14 font ref, but PDF readers seem to choke on this for reasons I haven't yet determined. (4) is the big problem. I can't do a proper implementation of (3) without some way to write the produced font out at the end. For (4) I'd really appreciate advice from the fop community. I need a way for a plugin to hook into output just before the xref table is written, so it can write new objects to the pdf stream. The object numbers for the fonts to be written out will have been reserved the first time that font was seen, I just need to write the data out and record the offset for the xref entry. As the data to be written is not known until the last embedded form XObject is known to have been written, the hook must be before xref write-out. To resolve (5), the whole FontTriplet assumption must be ripped out of the code and replaced with a more flexible representation of font info that is at minimum (Family, Weight, Slant, Expand/Condense amount, Variant) and probably needs an extensible map of additional matching characteristics for future-proofing too. This doesn't look like a fun thing to do! Right now, an answer to (4) would give me a chance to progress on de-duplicating fonts by attempting to combine subsets. I don't know if I'll be able to successfully combine subsets together, but I have more of a chance of that than I do making new subsets when I can't match fonts reliably enough. Ideas? -- Craig Ringer
