> Looking at a ccache with about 40,000 .o files in it (created with direct > mode turned on); of the 55 largest files, I found 11 pairs and one triplet > of identical object files. That's almost 25% of redundant storage that > could have been avoided by looking at the preprocessed hash when there is no > hit in direct mode.
It's much more interesting to look at the whole cache, I think. $ find -name '*.o' -type f | wc -l 39312 jlebar@turing:~/.ccache$ find -name '*.o' -type f | xargs -P16 sha1sum | cut -f 1 -d ' ' | sort | uniq -d | wc -l 1230 So it looks like there's some duplication on my machine, but not a ton. I'd be curious if you got significantly different numbers. On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz <[email protected]> wrote: > Hi Martin, > > Thanks for your responses. > > s/index hash/direct mode hash/g > > Apologies - I had a brain burp and was using the wrong terminology. > > That aside, however, with the advent of direct mode, there ARE two hashes > possible for any given object file - the direct mode hash (hashing all the > sources that go into the compilation) and the preprocessed hash (hashing the > result of running all those sources through the preprocessor). And any time > there is a cache miss, ccache has computed both those hashes, hasn't it? > (Or maybe not - if not, see discussion below.) And it appears to me that > in many cases, the resulting object file occurs twice in the cache, once > under each hash. And currently, those two occurrences are two separate > files, which could be combined into a single inode with two hard-linked > directory entries. > > Or am I confused about how direct mode interacts with preprocessed mode? If > running in direct mode, does ccache never compute the preprocessed hash? If > not, it obviously could, and I would recommend that it should. Why? > Because when changes are made to a widely-used header file, it very > commonly occurs that those changes only actually modify the preprocessor > output of a small subset of the sources that include that header file, while > many other sources don't use the changes (say, definition of new macros or > new constants), so end up with the same preprocessed output, and the same > object file, even though the input header files and direct mode hash did > change). In that case, ccache could still find hits in the cache with the > preprocessed mode, even if it's a miss with the direct mode hash. If ccache > does not get a direct mode hit, it certainly will have to RUN the > preprocessor to recompile the file - how much extra cost to compute the > preprocessed hash, look it up (to avoid recompiling if it is found with THAT > hash), and if a compile is still needed, store the resulting object file > inode with 2 directory entries rather than just one? > > The way I read the doc about how direct mode works, I thought it would > compute the direct mode hash, and if no hit, "fall back to preprocessed > mode". I thought that meant it would compute the preprocessed hash and look > for that too. Is that incorrect - does it only compute ONE hash in all > cases - a direct mode hash if running in direct mode and a preprocessed hash > if not in direct mode? If so, then let's modify my suggested enhancement to > be that in direct mode, calculate and use the preprocessed hash whenever > there is no hit with direct mode, and create hard links using all computed > hashes to the one single object file inode that eventually exists in the > ccache. I don't think direct mode and preprocessed mode HAVE to be mutually > exclusive - when direct mode gets a miss, preprocessed mode can still often > provide a hit. > > And if no preprocessed hash gets computed/stored when running in direct > mode, then I suspect that the reason I see so many pairs of identical object > files in my ccache is because of the situation I describe above, where a > header file change has triggered a direct mode hash miss, but preprocessing > the sources has resulted in an identical preprocessed file which was then > passed to the compiler which produced an identical object file. But ccache > didn't KNOW that they were identical, because it didn't compute the > preprocessed hash. > > Looking at a ccache with about 40,000 .o files in it (created with direct > mode turned on); of the 55 largest files, I found 11 pairs and one triplet > of identical object files. That's almost 25% of redundant storage that > could have been avoided by looking at the preprocessed hash when there is no > hit in direct mode. > > Thanks, > Frank > > > On 11/07/2011 12:53 AM, Martin Pool wrote: >> >> On 5 November 2011 11:12, Frank Klotz<[email protected]> >> wrote: >>> >>> I used ccache at my previous employer, and was very convinced of its >>> value. >>> Now that I have started a new job, I am in the process of trying to >>> bring >>> the new shop on board with ccache, so I have been doing lots of test runs >>> and looking at things. Here is one thing I am thinking could add some >>> value. >>> >>> Looking through the ccache, I find many pairs of files which have >>> different >>> names (different hashes), but exactly identical content. This actually >>> makes sense, as each file would have an index hash and a preprocessed >>> hash, >>> and since ccache needs to be able to find a match on either, then both >>> need >>> to be in the cache. >> >> What is an index hash? >> >>> (Actually, thinking about it, I'm a little surprised >>> that there are any files in the ccache that DON'T appear twice - >>> shouldn't >>> EVERY compilation have 2 hashes?) >> >> I don't understand why you would expect that. >> >> It seems like you expect there is another indirection layer by which >> ccache tries to find jobs that produce identical output. I don't >> think there is one at present. I don't think this would happen very >> often in reality, except perhaps for trivial cases like compiling >> empty files, and that's not so important to accelerate, and it will >> not use up much disk space. >> >> If you're getting duplicated cache files due to for instance doing >> builds in different directories or from different trees that produce >> identical output you could change the ccache options to make it less >> stringent. >> >>> But it seems to me that it would make a lot of sense to store the data of >>> these 2 files only once, by hard-linking the 2 names to the same inode. >>> (For filesystems that support hard links, of course!) Every time ccache >>> does an actual compilation and stores a file in the cache, it should >>> store >>> it under hard links for BOTH hashes - the indexed hash and the >>> proprocessed >>> hash. And if it gets a hash miss on the indexed hash but a hit on the >>> preprocessed hash, then it should add the missed index hash as a hard >>> link >>> to the file found. So a given file (inode) in the cache could actually >>> be >>> referenced by MANY directory entries: one preprocessed hash, and multiple >>> index hashes for various different combinations of source files and >>> header >>> files which end up producing the same output when passed through the >>> preprocessor. >> >> This mail is the first time google has heard of "ccache indexed hash"... >> >>> This could increase the storage efficiency of the ccache. >>> >>> Of course, since not every filesystem supports hard links, the simplest >>> solution was of course just to have multiple file copies. So I guess >>> adding >>> code to do this would require some way to determine if the filesystem the >>> cache is on can in fact support hardlinks. >>> >>> If you think this sounds like a good idea, but don't have bandwidth to do >>> it, I would be willing to give it a try. Any hints on where to start >>> would >>> of course be welcome. >>> >>> Thanks, >>> Frank Klotz >>> _______________________________________________ >>> ccache mailing list >>> [email protected] >>> https://lists.samba.org/mailman/listinfo/ccache >>> >>> > > _______________________________________________ > ccache mailing list > [email protected] > https://lists.samba.org/mailman/listinfo/ccache > _______________________________________________ ccache mailing list [email protected] https://lists.samba.org/mailman/listinfo/ccache
