On 11/11/2010 20:35, Jeremias Maerki wrote:
Thanks for the detailed explanation. I think I follow what you mean.
IIUC what you say above then when we fully embedded the CID TTF it would
not have been extractable? In the same way a subsetted font is
meaningless when extracted. If this is true then clearly there is little
value in making this configurable without also adding the extra tables
you mention above, which I am guessing is a lot of work and probably not
I fully understand the desire to install the font on a PostScript
printer to keep the PS files smaller. To answer your question: I did not
ask for the business use case. The problem I'm struggling with in this
context is how to know about the CID meaning of the font, i.e. the
multi-byte encoding of the font.
When we do subsets in FOP, we re-index the glyphs starting with index 1
(or 3) by occurrence in the document. Only FOP knows which Unicode
character is represented by which CID. That's why we need the ToUnicode
CMap in PDF. Otherwise, text extraction would not be so easy.
In single-byte mode, the whole font is embedded (right now probably with
the same problems I've just fixed with rev1034094 for the TTF subset).
In this mode the Adobe character names map into the font, so 8-bit
encodings can be built to properly address the right characters even if
the font is not embedded. That's also how we currently do referenced TTF
fonts for PDF output.
If we fully embed the font as a CID font, we currently lose the
knowledge about which index represents which Unicode character.
Combining the font with a suitable CMap resolves the problem but at the
moment we only use Identity-H which is a 1:1 mapping. One solution would
be to turn the Unicode "cmap" table in the TrueType font into a custom PS
CMap and then use 16-bit Unicode characters directly. FOP currently
doesn't support that.
What about Type1 fonts? Do we always embed the font fully and can they
be extracted for re-use?