At 18:50 20/09/2017 +0200, David Kastrup wrote:
Did you get to see the PostScript files before conversion with pstopdf?
Would being able to generate those differently make a difference?
I'm pretty sure Knut sent me everything, really everything. Not that I can
use it all, but its nice to have the complete set just in case.
The problem (for my idea) is not the generation of the individual
PostScript files, or the individual PDF files. However, there is some more
information on the process at the end of this mail which is (slightly)
illuminating, feel free to skip ahead past this explanation.
---------------------------------------------------------------------------------
What I was hoping to do (and this works for my test cases with simpler
fonts) was create the PDF files from the PostScript with only font
references, no font data embedded. Then create the final PDF, still with no
font data. Finally run that back through Ghostscript with the font
available to it. Then the individual uses of the font would pick up the one
and only font available, referenced from Ghostscript, and embed it.
That would (and does for my tests) create a final PDF file with only one
instance of the font.
The problem is that supporting non-PostScript fonts from disk as
replacements for PostScript fonts is tricky, it involves a certain amount
of guesswork to fill in missing information. Our support for TrueType fonts
isn't bad, but OTF fonts (those with CFF outlines) isn't as good. Also, the
nature of the font makes the guesswork rather more difficult, since it is
mostly a 'symbolic' font.
So basically that won't work, at least as things stand now.
---------------------------------------------------------------------------------
Those 125GB files, I wager, are for one-time printing or further
compression, not for public download from a website. So the comparison
is not entirely fair.
Well that one's anomalous, certainly, but we do have people passing around
multi-gigabyte files for download. Alos, the last game I picked up was
20GB, and that was a download only.
But, not important as I think I said.
Now, during the investigation of the files Knut sent me I did notice a few
things.
From what I understand of the process, the intention is that the entire
font is downloaded with each of the individual EPS files, and then the PDF
file which is created should contain the entire font (I'm fairly sure
someone said this). Then the individual PDF files are merged together in
TeX, presumably along with some other text, producing a PDF file where
there are multiple, identical, full copies of the font. You then take
advantage of the Ghostscript bug to treat all the copies of the font as
being the same.
I'm sorry to disappoint you, but that's not what is happening.
If the process were happening as described, then I believe mutool would be
quite able to detect the duplicate font streams in the final PDF file and
remove them. The reason that doesn't work is because the fonts embedded in
the individual PDF files are not complete, they are subsets. Worse still,
they don't have subset prefixes on the font name, so its not even clear
they are subsets.
For example, Knut sent me a bunch of EPS files and the PDF files created
from them, called testa-1.eps to teste-1.eps. Looking at the EPS files I see:
%%IncludeResource: ProcSet (FontSetInit)
%%BeginResource: FontSet (Emmentaler-20)
/FontSetInit /ProcSet findresource begin
%%BeginData: 64933 Binary Bytes
The following binary looks the same to me, I haven't bothered to check
precisely. All the EPS files appear to contain the same data. So I'll
assume that's a complete copy of the font. Note the size, just short of 65Kb.
But, looking at the PDF files, I see quite different results.
Testa-1.pdf:
9 0 obj
<<
/BaseFont /Emmentaler-20
/FontDescriptor 10 0 R
/Type /Font
/FirstChar 7
/LastChar 176
/Widths [ 641 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
490 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 424 ]
/Encoding 18 0 R
/Subtype /Type1
>>
endobj
10 0 obj
<<
/Type /FontDescriptor
/FontName /Emmentaler-20
/FontBBox [ 0 -635 645 1196 ]
/Flags 4
/Ascent 1196
/CapHeight 1196
/Descent -635
/ItalicAngle 0
/StemV 96
/MissingWidth 500
/FontFile3 17 0 R
>>
endobj
17 0 obj
<<
/Length 9653
/Subtype /Type1C
>>
stream
Testb-1.pdf
10 0 obj
<<
/BaseFont /Emmentaler-20
/FontDescriptor 11 0 R
/Type /Font
/FirstChar 7
/LastChar 176
/Widths [ 641 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 344 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 424 ]
/Encoding 20 0 R
/Subtype /Type1
>>
endobj
11 0 obj
<<
/Type /FontDescriptor
/FontName /Emmentaler-20
/FontBBox [ 0 -635 645 1196 ]
/Flags 4
/Ascent 1196
/CapHeight 1196
/Descent -635
/ItalicAngle 0
/StemV 96
/MissingWidth 500
/FontFile3 19 0 R
>>
endobj
19 0 obj
<<
/Length 9708
/Subtype /Type1C
>>
stream
As you can see the two font streams (which have been decompressed, so
there's no compression differences) are different lengths, and are both
shorter than the original, *much* shorter. Also, although the FirstChar and
LastChar entries in the fonts are the same, the entries in the Widths array
are different.
In short, the fonts are not complete, they have been subset. And, as I
noted above, even worse is the fact that the font names are not decorated
with a subset prefix.
Now, the lack of a prefix does mean that, in your particular case, you can
take advantage of the Ghostscript bug which treats fonts with the same name
as being the same.
Actually, even if we hadn't moved to using the PDF object number to
uniquely identify fonts, your approach was going to have a limited
lifespan. Here's the explanation, which you can skip over if you like, its
not hugely important.
---------------------------------------------------------------------------------
The reason is that this is precisely the problem we've been working towards
solving for some time. Originally Ghostscript simply used the FontName as a
unique identifier (because in PostScript that's how it works). If we saw
two uses of the same FontName we could be sure they were the same font.
But for PDF that doesn't work. Its entirely possible to have multiple fonts
with the same name in PDF, because the PDF file doesn't reference them by
name, it references them by object number (it is, in fact, possible to have
fonts which don't have a name at all in PDF files).
The upshot of this is that pdfwrite was seeing two different fonts, and
erroneously assuming they were the same. That meant we never bothered with
the second font, and so carried on with the first one. The problem is that
if the two fonts had different glyphs at the same position, then the final
PDF output would be wrong. We've had a number of examples of this over the
years. Its never a problem with a single file input, or with input other
than PDF, but if people passed multiple PDF files as input to pdfwrite,
then this could occur. If needs be I can dig up some of the bug reports.
Now we knew a long time back that the way to tackle this was to use the
object numbers, because in PDF these *are* unique, and the input filename
(in case two files should have fonts with the same name *and* the same
object numbers) but that was always going to be a big job, so in the
interim we kept on adding more heuristics to look at the properties of two
fonts and decide whether they were the same or not.
Sooner or later that process was going to trip you up, because we would had
added a heuristic which would have identified your fonts as being
different. Which would have had the same effect as using the PDF object
numbers does. Not only that, but we really wouldn't have had any option to
restore the old behaviour, because, as I've said, its really a bug.
---------------------------------------------------------------------------------
So, what to do.....?
Well, it occurs to me that the *real* problem here is that the fonts in the
individual PDF files are subsets. If they were not, then I believe you
could safely and easily use MuPDF (specifically mutool clean) to remove the
duplicate fonts. Or at least, the duplicate FontFile streams, I'm not
certain if the Font and FontDescriptor objects would be possible to remove
as well. But that would certainly cover a good portion of the file size,
the fonts are running at about 9Kb each, while the Font and FontDescriptor
objects are a few tens of bytes.
So the question then becomes 'why are the fonts subset ?' That's a really
good question, and the answer is that I don't know. Its possible that there
is a genuine pdfwrite bug here. The piece of information I'm missing is the
step used to create the PDF files from the EPS files, I don't know how you
are doing that.
My attempts to replicate the individual PDF files have been entirely
unsuccessful, I get files with three copies of the Emmentaler font embedded
instead of 1, and none of the three fonts match the ones in the PDF files
Knut supplied.
Hmm, actually, going back to the 9.21 release does produce at least similar
behaviour, whereas the 9.22 release does not. In 9.22 I get three fonts
output instead of 1. I've no idea why currently, and right at the moment I
don't have time to look.
I'll try and remember to look at it when I am not drowning under support,
but it looks like there have been changes in this area unrelated to the
PDFDontUseObjectNum bug, and that in itself may mean that your process
doesn't work any more, or works less well.
Ken
_______________________________________________
lilypond-devel mailing list
lilypond-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/lilypond-devel