Maybe you should try adding an option to check both hashes, and see how that performs. On Nov 8, 2011 1:03 PM, "Frank Klotz" <[email protected]> wrote:
> On 11/07/2011 10:55 AM, Justin Lebar wrote: > >> Looking at a ccache with about 40,000 .o files in it (created with direct >>> mode turned on); of the 55 largest files, I found 11 pairs and one >>> triplet >>> of identical object files. That's almost 25% of redundant storage that >>> could have been avoided by looking at the preprocessed hash when there >>> is no >>> hit in direct mode. >>> >> It's much more interesting to look at the whole cache, I think. >> >> $ find -name '*.o' -type f | wc -l >> 39312 >> jlebar@turing:~/.ccache$ find -name '*.o' -type f | xargs -P16 sha1sum >> | cut -f 1 -d ' ' | sort | uniq -d | wc -l >> 1230 >> >> So it looks like there's some duplication on my machine, but not a >> ton. I'd be curious if you got significantly different numbers. >> > Hi Justin, > > Here's what I got: > > > find -name '*.o' -type f | wc -l > 43507 > > find -name '*.o' -type f | xargs -P16 sha1sum | cut -f 1 -d ' ' | sort | > uniq -d | wc -l > 13087 > > I would say the difference is significant - you got 3%, while I got 30%. > > This is in a project environment where we have fairly fast churn in a > number of widely-used header files, and I am guessing that the changes in > those files often consist of addition of or changes to macros and constants > that are not used in all the files where the header files get included. > > Now I am more than ready to agree that there is room for improvement in > the design and implementation of the header file layouts here, and we are > working on that; but at the same time it looks to me like a fairly > straightforward enhancement to ccache could give us some performance boost > here. Since ccache already knows how to do preprocessed hashes, I think > all I am looking to do is to have it use that existing code when the direct > mode hash doesn't get a hit. Then put both hash names on the resulting > object file, and voila: better hit rates and more unique files in the cache. > > It would be interesting to know if many others see anything at all similar > to the numbers I have here (i.e., lots of people could use the enhancement > I am suggesting), of if there is something unique about this environment > that leads to this much duplication in the cache. > > Thanks, > Frank > >> On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz >> <frank.klotz@alcatel-lucent.**com <[email protected]>> >> wrote: >> >>> Hi Martin, >>> >>> Thanks for your responses. >>> >>> s/index hash/direct mode hash/g >>> >>> Apologies - I had a brain burp and was using the wrong terminology. >>> >>> That aside, however, with the advent of direct mode, there ARE two >>> hashes >>> possible for any given object file - the direct mode hash (hashing all >>> the >>> sources that go into the compilation) and the preprocessed hash (hashing >>> the >>> result of running all those sources through the preprocessor). And any >>> time >>> there is a cache miss, ccache has computed both those hashes, hasn't it? >>> (Or maybe not - if not, see discussion below.) And it appears to me >>> that >>> in many cases, the resulting object file occurs twice in the cache, once >>> under each hash. And currently, those two occurrences are two separate >>> files, which could be combined into a single inode with two hard-linked >>> directory entries. >>> >>> Or am I confused about how direct mode interacts with preprocessed mode? >>> If >>> running in direct mode, does ccache never compute the preprocessed hash? >>> If >>> not, it obviously could, and I would recommend that it should. Why? >>> Because when changes are made to a widely-used header file, it very >>> commonly occurs that those changes only actually modify the preprocessor >>> output of a small subset of the sources that include that header file, >>> while >>> many other sources don't use the changes (say, definition of new macros >>> or >>> new constants), so end up with the same preprocessed output, and the same >>> object file, even though the input header files and direct mode hash did >>> change). In that case, ccache could still find hits in the cache with >>> the >>> preprocessed mode, even if it's a miss with the direct mode hash. If >>> ccache >>> does not get a direct mode hit, it certainly will have to RUN the >>> preprocessor to recompile the file - how much extra cost to compute the >>> preprocessed hash, look it up (to avoid recompiling if it is found with >>> THAT >>> hash), and if a compile is still needed, store the resulting object file >>> inode with 2 directory entries rather than just one? >>> >>> The way I read the doc about how direct mode works, I thought it would >>> compute the direct mode hash, and if no hit, "fall back to preprocessed >>> mode". I thought that meant it would compute the preprocessed hash and >>> look >>> for that too. Is that incorrect - does it only compute ONE hash in all >>> cases - a direct mode hash if running in direct mode and a preprocessed >>> hash >>> if not in direct mode? If so, then let's modify my suggested >>> enhancement to >>> be that in direct mode, calculate and use the preprocessed hash whenever >>> there is no hit with direct mode, and create hard links using all >>> computed >>> hashes to the one single object file inode that eventually exists in the >>> ccache. I don't think direct mode and preprocessed mode HAVE to be >>> mutually >>> exclusive - when direct mode gets a miss, preprocessed mode can still >>> often >>> provide a hit. >>> >>> And if no preprocessed hash gets computed/stored when running in direct >>> mode, then I suspect that the reason I see so many pairs of identical >>> object >>> files in my ccache is because of the situation I describe above, where a >>> header file change has triggered a direct mode hash miss, but >>> preprocessing >>> the sources has resulted in an identical preprocessed file which was then >>> passed to the compiler which produced an identical object file. But >>> ccache >>> didn't KNOW that they were identical, because it didn't compute the >>> preprocessed hash. >>> >>> Looking at a ccache with about 40,000 .o files in it (created with direct >>> mode turned on); of the 55 largest files, I found 11 pairs and one >>> triplet >>> of identical object files. That's almost 25% of redundant storage that >>> could have been avoided by looking at the preprocessed hash when there >>> is no >>> hit in direct mode. >>> >>> Thanks, >>> Frank >>> >>> >>> On 11/07/2011 12:53 AM, Martin Pool wrote: >>> >>>> On 5 November 2011 11:12, Frank >>>> Klotz<frank.klotz@alcatel-**lucent.com<[email protected]> >>>> > >>>> wrote: >>>> >>>>> I used ccache at my previous employer, and was very convinced of its >>>>> value. >>>>> Now that I have started a new job, I am in the process of trying to >>>>> bring >>>>> the new shop on board with ccache, so I have been doing lots of test >>>>> runs >>>>> and looking at things. Here is one thing I am thinking could add some >>>>> value. >>>>> >>>>> Looking through the ccache, I find many pairs of files which have >>>>> different >>>>> names (different hashes), but exactly identical content. This actually >>>>> makes sense, as each file would have an index hash and a preprocessed >>>>> hash, >>>>> and since ccache needs to be able to find a match on either, then both >>>>> need >>>>> to be in the cache. >>>>> >>>> What is an index hash? >>>> >>>> (Actually, thinking about it, I'm a little surprised >>>>> that there are any files in the ccache that DON'T appear twice - >>>>> shouldn't >>>>> EVERY compilation have 2 hashes?) >>>>> >>>> I don't understand why you would expect that. >>>> >>>> It seems like you expect there is another indirection layer by which >>>> ccache tries to find jobs that produce identical output. I don't >>>> think there is one at present. I don't think this would happen very >>>> often in reality, except perhaps for trivial cases like compiling >>>> empty files, and that's not so important to accelerate, and it will >>>> not use up much disk space. >>>> >>>> If you're getting duplicated cache files due to for instance doing >>>> builds in different directories or from different trees that produce >>>> identical output you could change the ccache options to make it less >>>> stringent. >>>> >>>> But it seems to me that it would make a lot of sense to store the data >>>>> of >>>>> these 2 files only once, by hard-linking the 2 names to the same inode. >>>>> (For filesystems that support hard links, of course!) Every time >>>>> ccache >>>>> does an actual compilation and stores a file in the cache, it should >>>>> store >>>>> it under hard links for BOTH hashes - the indexed hash and the >>>>> proprocessed >>>>> hash. And if it gets a hash miss on the indexed hash but a hit on the >>>>> preprocessed hash, then it should add the missed index hash as a hard >>>>> link >>>>> to the file found. So a given file (inode) in the cache could actually >>>>> be >>>>> referenced by MANY directory entries: one preprocessed hash, and >>>>> multiple >>>>> index hashes for various different combinations of source files and >>>>> header >>>>> files which end up producing the same output when passed through the >>>>> preprocessor. >>>>> >>>> This mail is the first time google has heard of "ccache indexed hash"... >>>> >>>> This could increase the storage efficiency of the ccache. >>>>> >>>>> Of course, since not every filesystem supports hard links, the simplest >>>>> solution was of course just to have multiple file copies. So I guess >>>>> adding >>>>> code to do this would require some way to determine if the filesystem >>>>> the >>>>> cache is on can in fact support hardlinks. >>>>> >>>>> If you think this sounds like a good idea, but don't have bandwidth to >>>>> do >>>>> it, I would be willing to give it a try. Any hints on where to start >>>>> would >>>>> of course be welcome. >>>>> >>>>> Thanks, >>>>> Frank Klotz >>>>> ______________________________**_________________ >>>>> ccache mailing list >>>>> [email protected] >>>>> https://lists.samba.org/**mailman/listinfo/ccache<https://lists.samba.org/mailman/listinfo/ccache> >>>>> >>>>> >>>>> ______________________________**_________________ >>> ccache mailing list >>> [email protected] >>> https://lists.samba.org/**mailman/listinfo/ccache<https://lists.samba.org/mailman/listinfo/ccache> >>> >>> > _______________________________________________ ccache mailing list [email protected] https://lists.samba.org/mailman/listinfo/ccache
