Re: [ccache] Using git file hashes for ccache

Justin Lebar Fri, 31 Dec 2010 10:07:27 -0800

> B) index based on hash of file contents, but have a ccache maintain
> database of (file name + attributes) -> (hash of file contents) pairs


> C) index based on hash of file contents, and use git index for looking
> up (file name + attributes) -> (hash of file contents) pairs

> C benefits people who frequently switch their git workspace between
> multiple branches. When switching back to a previously compiled
> branch, the file mtimes will be updated, but the git index shows that
> the contents haven't.

I expect that approach B would speed up ccache direct mode hits
significantly, as most of the time, you'd only hash the source file,
and you'd use the cached hashes of the files it includes.  If it runs
faster, presumably there would be less to gain by invoking ccache less
often.

I'm very skeptical that we want to add to ccache the kind of
complexity (and tight coupling!) that option C requires.  Furthermore,
it seems to me that some of this logic (e.g. "don't build me because,
although my mtime has changed, my contents haven't") belongs in the
build system.

I'd also guess that C wouldn't be much faster than B, since in the
steady state, B hashes only the source file and has cached hashes of
most or all of the source file's includes.

On Fri, Dec 31, 2010 at 8:12 AM, Michel Lespinasse <wal...@google.com> wrote:
> On Fri, Dec 31, 2010 at 4:27 AM, Wilson Snyder <wsny...@wsnyder.org> wrote:
>> I also think this is a good approach, though having been
>> down the road before, mtime isn't always enough as you
>> noted, but including the size also makes it *almost*
>> perfect.  Most edits change the size.
>>
>> Note several tools like scons use this technique, and some
>> store the hashes in a single hash file inside each source
>> directory.  That has the nice advantage of allowing sharing,
>> though the downside of poluting the source areas so I don't
>> really like it.  I think putting it into the ccache
>> infrastructure is nicer; but you may still want multiple
>> hashes to be stored under a hash of the directory name,
>> instead of a hash of the filename, because that allows
>> reading fewer files.  (Otherwise reading the hundreds of
>> hash files will become the new bottleneck.)
>
> I actually see 3 different variants being discussed in this thread:
>
> A) index based on hash of file name + attributes instead of hash of
> file contents
> B) index based on hash of file contents, but have a ccache maintain
> database of (file name + attributes) -> (hash of file contents) pairs
> C) index based on hash of file contents, and use git index for looking
> up (file name + attributes) -> (hash of file contents) pairs
>
> A is simplest, and would probably work well enough for system include
> files. Not so much for project files though, especially if we want to
> support CCACHE_BASEDIR (ctime/mtime probably won't match across
> checked out versions).
>
> B could work pretty well, I think. There is the question of where to
> store that new database, but it's probably doable - the database is
> only a cache, so it's always OK to expire entries if it grows too
> much.
>
> C benefits people who frequently switch their git workspace between
> multiple branches. When switching back to a previously compiled
> branch, the file mtimes will be updated, but the git index shows that
> the contents haven't. This type of operation is the source of many
> ccache hits for me (after all, the compiler wouldn't even get invoked
> by make if no mtimes had changed).
>
> Making C work seems complicated, as we'd need to be able to read the
> git index. OTOH, this also nicely solves the problem of expiring
> database entries: git is in charge of maintaining the index so we
> don't need to care about it for project files, and out-of-project
> files such as system headers shouldn't change nearly as often so we'd
> hardly ever need to expire them from the ccache database. We could
> even avoid any problems of concurrent database updates by just never
> having ccache update any (file name + attributes) -> (hash of file
> contents) database - git would be in charge of updating its index for
> in-project files, and we could have an out-of-line ccache option to do
> it for infrequently-modified system files...
>
> --
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.
>
_______________________________________________
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache

Re: [ccache] Using git file hashes for ccache

Reply via email to