On Nov 8, 2011 1:03 PM, "Frank Klotz" <[email protected]
<mailto:[email protected]>> wrote:
On 11/07/2011 10:55 AM, Justin Lebar wrote:
Looking at a ccache with about 40,000 .o files in it
(created with direct
mode turned on); of the 55 largest files, I found 11 pairs
and one triplet
of identical object files. That's almost 25% of redundant
storage that
could have been avoided by looking at the preprocessed
hash when there is no
hit in direct mode.
It's much more interesting to look at the whole cache, I think.
$ find -name '*.o' -type f | wc -l
39312
jlebar@turing:~/.ccache$ find -name '*.o' -type f | xargs -P16
sha1sum
| cut -f 1 -d ' ' | sort | uniq -d | wc -l
1230
So it looks like there's some duplication on my machine, but not a
ton. I'd be curious if you got significantly different numbers.
Hi Justin,
Here's what I got:
> find -name '*.o' -type f | wc -l
43507
> find -name '*.o' -type f | xargs -P16 sha1sum | cut -f 1 -d ' '
| sort | uniq -d | wc -l
13087
I would say the difference is significant - you got 3%, while I
got 30%.
This is in a project environment where we have fairly fast churn
in a number of widely-used header files, and I am guessing that
the changes in those files often consist of addition of or changes
to macros and constants that are not used in all the files where
the header files get included.
Now I am more than ready to agree that there is room for
improvement in the design and implementation of the header file
layouts here, and we are working on that; but at the same time it
looks to me like a fairly straightforward enhancement to ccache
could give us some performance boost here. Since ccache already
knows how to do preprocessed hashes, I think all I am looking to
do is to have it use that existing code when the direct mode hash
doesn't get a hit. Then put both hash names on the resulting
object file, and voila: better hit rates and more unique files in
the cache.
It would be interesting to know if many others see anything at all
similar to the numbers I have here (i.e., lots of people could use
the enhancement I am suggesting), of if there is something unique
about this environment that leads to this much duplication in the
cache.
Thanks,
Frank
On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz
<[email protected]
<mailto:[email protected]>> wrote:
Hi Martin,
Thanks for your responses.
s/index hash/direct mode hash/g
Apologies - I had a brain burp and was using the wrong
terminology.
That aside, however, with the advent of direct mode,
there ARE two hashes
possible for any given object file - the direct mode hash
(hashing all the
sources that go into the compilation) and the preprocessed
hash (hashing the
result of running all those sources through the
preprocessor). And any time
there is a cache miss, ccache has computed both those
hashes, hasn't it?
(Or maybe not - if not, see discussion below.) And it
appears to me that
in many cases, the resulting object file occurs twice in
the cache, once
under each hash. And currently, those two occurrences are
two separate
files, which could be combined into a single inode with
two hard-linked
directory entries.
Or am I confused about how direct mode interacts with
preprocessed mode? If
running in direct mode, does ccache never compute the
preprocessed hash? If
not, it obviously could, and I would recommend that it
should. Why?
Because when changes are made to a widely-used header
file, it very
commonly occurs that those changes only actually modify
the preprocessor
output of a small subset of the sources that include that
header file, while
many other sources don't use the changes (say, definition
of new macros or
new constants), so end up with the same preprocessed
output, and the same
object file, even though the input header files and direct
mode hash did
change). In that case, ccache could still find hits in
the cache with the
preprocessed mode, even if it's a miss with the direct
mode hash. If ccache
does not get a direct mode hit, it certainly will have to
RUN the
preprocessor to recompile the file - how much extra cost
to compute the
preprocessed hash, look it up (to avoid recompiling if it
is found with THAT
hash), and if a compile is still needed, store the
resulting object file
inode with 2 directory entries rather than just one?
The way I read the doc about how direct mode works, I
thought it would
compute the direct mode hash, and if no hit, "fall back to
preprocessed
mode". I thought that meant it would compute the
preprocessed hash and look
for that too. Is that incorrect - does it only compute
ONE hash in all
cases - a direct mode hash if running in direct mode and a
preprocessed hash
if not in direct mode? If so, then let's modify my
suggested enhancement to
be that in direct mode, calculate and use the preprocessed
hash whenever
there is no hit with direct mode, and create hard links
using all computed
hashes to the one single object file inode that eventually
exists in the
ccache. I don't think direct mode and preprocessed mode
HAVE to be mutually
exclusive - when direct mode gets a miss, preprocessed
mode can still often
provide a hit.
And if no preprocessed hash gets computed/stored when
running in direct
mode, then I suspect that the reason I see so many pairs
of identical object
files in my ccache is because of the situation I describe
above, where a
header file change has triggered a direct mode hash miss,
but preprocessing
the sources has resulted in an identical preprocessed file
which was then
passed to the compiler which produced an identical object
file. But ccache
didn't KNOW that they were identical, because it didn't
compute the
preprocessed hash.
Looking at a ccache with about 40,000 .o files in it
(created with direct
mode turned on); of the 55 largest files, I found 11 pairs
and one triplet
of identical object files. That's almost 25% of redundant
storage that
could have been avoided by looking at the preprocessed
hash when there is no
hit in direct mode.
Thanks,
Frank
On 11/07/2011 12:53 AM, Martin Pool wrote:
On 5 November 2011 11:12, Frank
Klotz<[email protected]
<mailto:[email protected]>>
wrote:
I used ccache at my previous employer, and was
very convinced of its
value.
Now that I have started a new job, I am in the
process of trying to
bring
the new shop on board with ccache, so I have been
doing lots of test runs
and looking at things. Here is one thing I am
thinking could add some
value.
Looking through the ccache, I find many pairs of
files which have
different
names (different hashes), but exactly identical
content. This actually
makes sense, as each file would have an index hash
and a preprocessed
hash,
and since ccache needs to be able to find a match
on either, then both
need
to be in the cache.
What is an index hash?
(Actually, thinking about it, I'm a little surprised
that there are any files in the ccache that DON'T
appear twice -
shouldn't
EVERY compilation have 2 hashes?)
I don't understand why you would expect that.
It seems like you expect there is another indirection
layer by which
ccache tries to find jobs that produce identical
output. I don't
think there is one at present. I don't think this
would happen very
often in reality, except perhaps for trivial cases
like compiling
empty files, and that's not so important to
accelerate, and it will
not use up much disk space.
If you're getting duplicated cache files due to for
instance doing
builds in different directories or from different
trees that produce
identical output you could change the ccache options
to make it less
stringent.
But it seems to me that it would make a lot of
sense to store the data of
these 2 files only once, by hard-linking the 2
names to the same inode.
(For filesystems that support hard links, of
course!) Every time ccache
does an actual compilation and stores a file in
the cache, it should
store
it under hard links for BOTH hashes - the indexed
hash and the
proprocessed
hash. And if it gets a hash miss on the indexed
hash but a hit on the
preprocessed hash, then it should add the missed
index hash as a hard
link
to the file found. So a given file (inode) in the
cache could actually
be
referenced by MANY directory entries: one
preprocessed hash, and multiple
index hashes for various different combinations of
source files and
header
files which end up producing the same output when
passed through the
preprocessor.
This mail is the first time google has heard of
"ccache indexed hash"...
This could increase the storage efficiency of the
ccache.
Of course, since not every filesystem supports
hard links, the simplest
solution was of course just to have multiple file
copies. So I guess
adding
code to do this would require some way to
determine if the filesystem the
cache is on can in fact support hardlinks.
If you think this sounds like a good idea, but
don't have bandwidth to do
it, I would be willing to give it a try. Any
hints on where to start
would
of course be welcome.
Thanks,
Frank Klotz
_______________________________________________
ccache mailing list
[email protected] <mailto:[email protected]>
https://lists.samba.org/mailman/listinfo/ccache
_______________________________________________
ccache mailing list
[email protected] <mailto:[email protected]>
https://lists.samba.org/mailman/listinfo/ccache