On 21/12/2011 5:07 PM, Chris Bowditch wrote:
FOP can't currently fully embed a font in PDF, so even if you had the source font available the code changes required could be extensive. For us, this approach isn't an option because we don't have the source font to register in fop.xconf and embed. Therefore I am interested in knowing what you've come up with in terms of merging subsets together to create 1 super subset. That in my view is the most difficult challenge in this problem. Resolving the problems with the cross references and the point at which IDs are assigned should be solvable with a little code refactoring. I'm sure one of the guys will speak up if that's not the case.
As yet I haven't begun to tackle the actual merging of Type 1 or TrueType subsets into a single font. I've done the accumulation and merging of the widths arrays, but not the fonts themselves. I plan to make new minimum subsets from local fonts if they're available, and will try merging of actual embedded font files only if I can't get that to work or if I have time. I don't know font data structures well enough to want to try merging subset embedded font files if I can possibly avoid it.
I've just finished writing and testing the code to accumulate information on each font as its encountered in a source PDF and merge it into a collection of font information keyed by (FontName,SubType,Encoding). I compare the metrics to ensure that the fonts are really compatible and if they are I merge the widths arrays and startchar/endchar to produce information. At the end of the run I can now produce a font dictionary and font descriptor for the minimum subset required to satisfy the requirements of each of the embedded documents using that font.
I can report on font usage, glyph usage within each font, and potential size savings, but I don't yet have it actually replacing the fonts. That's what I'll be working on today. First I'll be trying to use fop's font embedding mechanism to do it, which will require adding some callbacks to fop's pdf output to run code just before the resource dictionary is written out so I can inform fop of the required glyphs. I'll be delaying the writing of all the xobject resource dictionaries until after the fop resource dictionary is written so I know the fop font oids and can embed them in the xobject resource dictionaries. With luck I'm hoping I'll be able to write the minimum subset but I haven't looked into fop's font embedding code in enough detail to be sure exactly what I can do or how, so I'll be going delving shortly.
If this approach works the next step will be to allocate font object IDs early so I don't need to waste memory on delaying xobject resource dictionary writes and so I can avoid writing keys for fonts fop its self never uses to fop's resource dictionary.
Yesterday I attempted to unembed base-14 fonts during import of PDF content, so I'd recognise fonts like Helvetica in type1 and replace them with a font dictionary for a base14 font reference rather than the embed dictionary. Acrobat choked on the result for reasons I'm not entirely sure of as it looked OK structurally. I'm not sure quite what was wrong, but hope to have more luck with re-embedding rather than replacement with a base-14 font.
On a side note, I also need to enhance the font info collection code so it keys on more of the font metrics. Currently the first font with a given (FontName,SubType,Encoding) tuple is registered for that key, and if subsequent fonts with the same key but incompatible metrics are encountered they're copied over verbatim exactly as is currently the case. Expanding the key to cover the font bbox, ascent and descent etc will help solve that and won't be hard, I'm just leaving it until I have a proof of concept font re-embed working.
-- Craig Ringer