My reply is interleaved below, but there's something important to cover
before reading on.
There's clearly a difference in what I mean by de-duplication vs what
you're thinking I mean by de-duplication. As far as I can tell you're
looking at font substitution and un/re-embedding, where (eg) Helvetica
LT Std is replaced with Helvetica Neue Sans, a different version of
Helvetica LT Std, the built-in Helvetica derived from Adobe's
multi-master fonts, or whatever. The replacement font might not have
matching metrics and certainly wouldn't be identical.
That's *not* what I'm talking about. I'm talking about the case where
multiple embedded subsets derived from the *exact* *same* *font* exist,
each containing partially overlapping sets of glyphs where each glyph is
*identical* to those in the other subsets.
This is best illustrated by example. Take three input PDFs that are
being placed as images (say, engineering diagrams, advertisments or
breakouts in a layout, or whatever), named "1.pdf", "2.pdf" and "3.pdf"
that will be written into "out.pdf". For the sake of this example,
presume that content in "out.pdf" uses "Arial Regular" for its own text
so that font must also be embedded.
1.pdf:
Helvetica Neue Sans subset [a cde h]
Utopia Black [abcd]
2.pdf:
Helvetica Neue Sans subset [abcde ]
Helvetica LT Std [ab def ijk]
3.pdf:
Helvetica Neue Sans subset [ c efgh]
Desired output is:
o.pdf:
Helvetica Neue Sans subset [abcdefgh]
Utopia Black [abcd]
Helvetica LT Std [ab def ijk]
Arial Regular (whatever the text in out.pdf requires)
Fop and fop-pdf-image currently produce:
1.pdf:
Helvetica Neue Sans subset [a cde h]
Helvetica Neue Sans subset [abcde ]
Helvetica Neue Sans subset [ c efgh]
Utopia Black [abcd]
Helvetica LT Std [ab def ijk]
Arial Regular (whatever the text in out.pdf requires)
... meaning that there are 3 copes of h.n.s "c" plus 2 copies of "d",
"e" and "h" from *identical* fonts (presuming each input had the same
version of h.n.s as verified by metrics or for the truly paranoid even
glyph data checksums). You appear to think I want to produce:
o.pdf:
Helvetica Neue Sans [abcdefghijk]
Utopia Black [abcd]
Arial Regular (whatever the text in out.pdf requires)
or even:
o.pdf:
Arial Regular (out.pdf glyph usage plus [abcdefghijk])
Utopia Black [abcd]
... where Helvetica Neue Sans and Helvetica LT Std are "de-duplicated"
despite not being true duplicates of each other, or in the latter case
both are replaced with the "equivalent" (approximately) Arial Regular.
That is *not* what I want; that would be completely incorrect to do
automatically.
On 03/06/2012 07:08 PM, mehdi houshmand wrote:
Font de-duping is intrinsically a post-process action, you need the
full document, with all fonts, before you can do any font de-duping.
PostScript does this very thing (to a much lesser extent) with the
<optimize-resources> tag, as a post-process action.
I absolutely disagree that font optimization must be done in a second pass.
Font de-duplication requires knowledge of all the fonts in the document,
yes. That doesn't make it necessarily a post-process operation. PDF is a
wonderfully non-linear format, and it's trivial to delay writing out
fonts until the end of the document. PDF simply doesn't care where the
fonts appear in the document. Once you know the last content stream has
been written out (say, just before you write the xref tables) you know
no more new glyphs will be used and no new fonts will be referenced, so
you can write out the fonts you need.
The only operation in PDF that is (almost) forced to be post-process is
writing out linearized ("fast web view" or "web optimized") PDF. That's
because web-optimized PDF must have a partial xref table and the trailer
dictionary near the *start* of the file. It's actually still possible to
create linearised pdf by streaming it out in a single pass, but you need
to know more in advance about what you'll be writing out so in practice
it's much simpler to linearise by post-processing.
Also, the requirements aren't clear here, what is it we want here? Let
me validate that, this shouldn't change the (I guess we can call it)
"canonical" PDF document. By that I mean if you rasterized a PDF
before and after this change they should be identical,
pixel-for-pixel.
I agree.
When Acrobat does the font de-duping (I don't
remember how much control it gives you, but if there are levels of
de-duping I would have chosen the most aggressive), the documents
aren't identical.
That's because it's actually substituting fonts, replacing one font with
another with non-identical metrics. That's not what I want to do, I want
to *merge* overlapping subsets of fonts with identical metrics. Since
the font dictionary gives the metric information it's practical to do
this. If fonts don't have the same metrics, you don't de-dupe them
because they're not duplicates.
"Optimizing" a PDF by substituting one font for another is a completely
different and much bigger job. Replacement of one font with another
non-identical font is a different job that may require rewriting of
content streams (for encoding differences), the production of multiple
font dictionaries with different encodings to remap different content
streams to use one font file, etc. It's hairy and complicated and I
don't want to go there.
There are aberrations caused by slight kerning
differences between various verisons of Arial. This may seem trivial
when compared to bloated PDFs, but it looks tacky and lowers the high
standard of documents.
If the metrics don't match, they're not the same font and they don't get
merged. The glyph metrics in the font dictionary should be sufficient to
handle this.
Having three partial subsets of Arial in a document, each slightly
different versions with slightly different metrics, is something I can
live with. The problem arises when you have 10 different
mostly-overlapping subsets of the *exact* *same* *glyph* *data* from
each of those, leaving you with *30* small-ish copies of Arial instead
of 3 slightly larger ones.
The other issue is you have subset fonts created by FOP as well as
those imported by the pdf-image-plugin. You'd have to create some
bridge between the image loading framework and the font loading system
*cough* HACK *cough*.
Only if you want to handle de-dupe between fop-loaded fonts and fonts
loaded from input PDFs. I don't think that's particularly vital, but it
might not be as bad as you think either.
The font matching and subset merging system required for pdf-image to
de-dupe fonts would have to track glyph metrics, font names, etc for
every font seen, and would need to accumulate information on needed
glyphs, etc until the end of output generation just before the xref is
written. Fop must maintain used-glyph information as it stands, and
already knows glyph metrics, so it's entirely practical for it to report
that into the same system. From there, it's not too much of a stretch to
see pdf-image recognising that fop is going to embed a font with the
same name and metrics already and just merging its required-glyph list
with fop's before fop generates the subset.
That's a significantly bigger project, though. Just being able to merge
completely redundant glyph subsets where the glyph data and metrics are
exactly identical between partially overlapping subsets being loaded by
fop-pdf-image would be a nice start.
The best thing about all this it that it's practical to do it progressively.
Alternatively, just thinking aloud here, if this
was done as a post-process *wink* *wink* *wry smile*...
While it can be done in post-process, I'm really not convinced it's
necessary. FOP handles image scaling and resampling - why don't we do
that in post-process, too? Just generate a monstrously huge PDF full of
uncompressed images, then re-sample later?
The answer seems to be because it's practical to do it in one pass, it's
nicer for users, and it works well.
Why does fop have font subsetting support? Subsetting can be done in
post-process, all you have to do is read the content streams and
determine which glyphs are used, then rewrite the font. It's done in a
single pass because it's *much* easier to implement that way, when fop
already knows the glyphs it's used. Same deal: it could be done in a
post pass, but it isn't because it doesn't make sense to do so.
Font replacement and the substitution of non-identical fonts should be
done in post, because it's not practical to do them in a way that's
going to be easy, reliable and automatic, nor are there any obvious
correct choices. We don't know if the document designer wants to replace
their own copy of Helvetica with Adobe's multi-master version. On the
other hand, it's pretty bloody obvious that the user won't want 100
copies each of "abcdefg...." glyphs from Helvetica LT Std that are
*exactly* *the* *same* when they can have just one copy of each with no
effect on document display.
Apologies if I may seem to be argumentative here, it's not my
intention, but I feel this is would be serious scope creep. I see the
pdf-image-plugin as a plugin that treats PDFs as images, nothing more.
If you want to stitch together PDFs, PDFBox is designed just for that.
The trouble is that fop-pdf-image exists because PDFs aren't just
images. If they were, it'd be much easier to just rasterise them and
import them in raster form.
FWIW, I'm not trying to use fop to "stitch together PDFs" - not in the
sense of trying to use it to append, n-up, impose, etc complex PDF
documents. I'm using small PDFs that are basically "images" - but
represented as a combination of raster, text and bitmap data that should
be included in the output document as efficiently as possible and
without loss of fidelity. IOW, exactly what fop-pdf-image is for.
--
Craig Ringer