Re: Fwd: Google Summer of Code

Craig Ringer Tue, 06 Mar 2012 06:51:52 -0800

My reply is interleaved below, but there's something important to coverbefore reading on.

There's clearly a difference in what I mean by de-duplication vs whatyou're thinking I mean by de-duplication. As far as I can tell you'relooking at font substitution and un/re-embedding, where (eg) HelveticaLT Std is replaced with Helvetica Neue Sans, a different version ofHelvetica LT Std, the built-in Helvetica derived from Adobe'smulti-master fonts, or whatever. The replacement font might not havematching metrics and certainly wouldn't be identical.

That's *not* what I'm talking about. I'm talking about the case wheremultiple embedded subsets derived from the *exact* *same* *font* exist,each containing partially overlapping sets of glyphs where each glyph is*identical* to those in the other subsets.

This is best illustrated by example. Take three input PDFs that arebeing placed as images (say, engineering diagrams, advertisments orbreakouts in a layout, or whatever), named "1.pdf", "2.pdf" and "3.pdf"that will be written into "out.pdf". For the sake of this example,presume that content in "out.pdf" uses "Arial Regular" for its own textso that font must also be embedded.


1.pdf:
       Helvetica Neue Sans subset [a cde  h]
       Utopia Black               [abcd]
2.pdf:
       Helvetica Neue Sans subset [abcde   ]
       Helvetica LT Std           [ab def  ijk]
3.pdf:
       Helvetica Neue Sans subset [  c efgh]

Desired output is:

o.pdf:
       Helvetica Neue Sans subset [abcdefgh]
       Utopia Black               [abcd]
Helvetica LT Std           [ab def  ijk]
       Arial Regular              (whatever the text in out.pdf requires)

Fop and fop-pdf-image currently produce:

1.pdf:
       Helvetica Neue Sans subset [a cde  h]
       Helvetica Neue Sans subset [abcde   ]
       Helvetica Neue Sans subset [  c efgh]
       Utopia Black               [abcd]
Helvetica LT Std           [ab def  ijk]
       Arial Regular              (whatever the text in out.pdf requires)

... meaning that there are 3 copes of h.n.s "c" plus 2 copies of "d","e" and "h" from *identical* fonts (presuming each input had the sameversion of h.n.s as verified by metrics or for the truly paranoid evenglyph data checksums). You appear to think I want to produce:


o.pdf:
       Helvetica Neue Sans        [abcdefghijk]
       Utopia Black               [abcd]
       Arial Regular              (whatever the text in out.pdf requires)

or even:

o.pdf:
       Arial Regular              (out.pdf glyph usage plus [abcdefghijk])
       Utopia Black               [abcd]

... where Helvetica Neue Sans and Helvetica LT Std are "de-duplicated"despite not being true duplicates of each other, or in the latter caseboth are replaced with the "equivalent" (approximately) Arial Regular.

That is *not* what I want; that would be completely incorrect to doautomatically.



On 03/06/2012 07:08 PM, mehdi houshmand wrote:

Font de-duping is intrinsically a post-process action, you need the
full document, with all fonts, before you can do any font de-duping.
PostScript does this very thing (to a much lesser extent) with the
<optimize-resources>  tag, as a post-process action.

I absolutely disagree that font optimization must be done in a second pass.

Font de-duplication requires knowledge of all the fonts in the document,yes. That doesn't make it necessarily a post-process operation. PDF is awonderfully non-linear format, and it's trivial to delay writing outfonts until the end of the document. PDF simply doesn't care where thefonts appear in the document. Once you know the last content stream hasbeen written out (say, just before you write the xref tables) you knowno more new glyphs will be used and no new fonts will be referenced, soyou can write out the fonts you need.

The only operation in PDF that is (almost) forced to be post-process iswriting out linearized ("fast web view" or "web optimized") PDF. That'sbecause web-optimized PDF must have a partial xref table and the trailerdictionary near the *start* of the file. It's actually still possible tocreate linearised pdf by streaming it out in a single pass, but you needto know more in advance about what you'll be writing out so in practiceit's much simpler to linearise by post-processing.

Also, the requirements aren't clear here, what is it we want here? Let
me validate that, this shouldn't change the (I guess we can call it)
"canonical" PDF document. By that I mean if you rasterized a PDF
before and after this change they should be identical,
pixel-for-pixel.

I agree.

When Acrobat does the font de-duping (I don't
remember how much control it gives you, but if there are levels of
de-duping I would have chosen the most aggressive), the documents
aren't identical.

That's because it's actually substituting fonts, replacing one font withanother with non-identical metrics. That's not what I want to do, I wantto *merge* overlapping subsets of fonts with identical metrics. Sincethe font dictionary gives the metric information it's practical to dothis. If fonts don't have the same metrics, you don't de-dupe thembecause they're not duplicates.

"Optimizing" a PDF by substituting one font for another is a completelydifferent and much bigger job. Replacement of one font with anothernon-identical font is a different job that may require rewriting ofcontent streams (for encoding differences), the production of multiplefont dictionaries with different encodings to remap different contentstreams to use one font file, etc. It's hairy and complicated and Idon't want to go there.

There are aberrations caused by slight kerning
differences between various verisons of Arial. This may seem trivial
when compared to bloated PDFs, but it looks tacky and lowers the high
standard of documents.

If the metrics don't match, they're not the same font and they don't getmerged. The glyph metrics in the font dictionary should be sufficient tohandle this.

Having three partial subsets of Arial in a document, each slightlydifferent versions with slightly different metrics, is something I canlive with. The problem arises when you have 10 differentmostly-overlapping subsets of the *exact* *same* *glyph* *data* fromeach of those, leaving you with *30* small-ish copies of Arial insteadof 3 slightly larger ones.

The other issue is you have subset fonts created by FOP as well as
those imported by the pdf-image-plugin. You'd have to create some
bridge between the image loading framework and the font loading system
*cough* HACK *cough*.

Only if you want to handle de-dupe between fop-loaded fonts and fontsloaded from input PDFs. I don't think that's particularly vital, but itmight not be as bad as you think either.

The font matching and subset merging system required for pdf-image tode-dupe fonts would have to track glyph metrics, font names, etc forevery font seen, and would need to accumulate information on neededglyphs, etc until the end of output generation just before the xref iswritten. Fop must maintain used-glyph information as it stands, andalready knows glyph metrics, so it's entirely practical for it to reportthat into the same system. From there, it's not too much of a stretch tosee pdf-image recognising that fop is going to embed a font with thesame name and metrics already and just merging its required-glyph listwith fop's before fop generates the subset.

That's a significantly bigger project, though. Just being able to mergecompletely redundant glyph subsets where the glyph data and metrics areexactly identical between partially overlapping subsets being loaded byfop-pdf-image would be a nice start.


The best thing about all this it that it's practical to do it progressively.

  Alternatively, just thinking aloud here, if this
was done as a post-process *wink* *wink* *wry smile*...

While it can be done in post-process, I'm really not convinced it'snecessary. FOP handles image scaling and resampling - why don't we dothat in post-process, too? Just generate a monstrously huge PDF full ofuncompressed images, then re-sample later?

The answer seems to be because it's practical to do it in one pass, it'snicer for users, and it works well.

Why does fop have font subsetting support? Subsetting can be done inpost-process, all you have to do is read the content streams anddetermine which glyphs are used, then rewrite the font. It's done in asingle pass because it's *much* easier to implement that way, when fopalready knows the glyphs it's used. Same deal: it could be done in apost pass, but it isn't because it doesn't make sense to do so.

Font replacement and the substitution of non-identical fonts should bedone in post, because it's not practical to do them in a way that'sgoing to be easy, reliable and automatic, nor are there any obviouscorrect choices. We don't know if the document designer wants to replacetheir own copy of Helvetica with Adobe's multi-master version. On theother hand, it's pretty bloody obvious that the user won't want 100copies each of "abcdefg...." glyphs from Helvetica LT Std that are*exactly* *the* *same* when they can have just one copy of each with noeffect on document display.

Apologies if I may seem to be argumentative here, it's not my
intention, but I feel this is would be serious scope creep. I see the
pdf-image-plugin as a plugin that treats PDFs as images, nothing more.
If you want to stitch together PDFs, PDFBox is designed just for that.

The trouble is that fop-pdf-image exists because PDFs aren't justimages. If they were, it'd be much easier to just rasterise them andimport them in raster form.

FWIW, I'm not trying to use fop to "stitch together PDFs" - not in thesense of trying to use it to append, n-up, impose, etc complex PDFdocuments. I'm using small PDFs that are basically "images" - butrepresented as a combination of raster, text and bitmap data that shouldbe included in the output document as efficiently as possible andwithout loss of fidelity. IOW, exactly what fop-pdf-image is for.


--
Craig Ringer

Re: Fwd: Google Summer of Code

Reply via email to