Re: [ccache] Duplicate object files in the ccache - possible optimization?

Frank Klotz Tue, 08 Nov 2011 09:55:56 -0800

 On 11/08/2011 01:36 AM, Martin Pool wrote:

Maybe you should try adding an option to check both hashes, and seehow that performs.

Yes, it looks like that's where I'm headed. I was hoping I might findsomeone else who had already tried this, or thought about it; but itlooks like I'm the pioneer here, so if/as I get time to experiment withit I will try it out.


Thanks much,
Frank

On Nov 8, 2011 1:03 PM, "Frank Klotz" <[email protected]<mailto:[email protected]>> wrote:


     On 11/07/2011 10:55 AM, Justin Lebar wrote:

            Looking at a ccache with about 40,000 .o files in it
            (created with direct
            mode turned on); of the 55 largest files, I found 11 pairs
            and one triplet
            of identical object files.  That's almost 25% of redundant
            storage that
            could have been avoided by looking at the preprocessed
            hash when there is no
            hit in direct mode.

        It's much more interesting to look at the whole cache, I think.

        $ find -name '*.o' -type f | wc -l
        39312
        jlebar@turing:~/.ccache$ find -name '*.o' -type f | xargs -P16
        sha1sum
        | cut -f 1 -d ' ' | sort | uniq -d | wc -l
        1230

        So it looks like there's some duplication on my machine, but not a
        ton.  I'd be curious if you got significantly different numbers.

    Hi Justin,

    Here's what I got:

    > find -name '*.o' -type f | wc -l
    43507
    > find -name '*.o' -type f | xargs -P16 sha1sum | cut -f 1 -d ' '
    | sort | uniq -d | wc -l
    13087

    I would say the difference is significant - you got 3%, while I
    got 30%.

    This is in a project environment where we have fairly fast churn
    in a number of widely-used header files, and I am guessing that
    the changes in those files often consist of addition of or changes
    to macros and constants that are not used in all the files where
    the header files get included.

    Now I am more than ready to agree that there is room for
    improvement in the design and implementation of the header file
    layouts here, and we are working on that; but at the same time it
    looks to me like a fairly straightforward enhancement to ccache
    could give us some performance boost here.  Since ccache already
    knows how to do preprocessed hashes, I think all I am looking to
    do is to have it use that existing code when the direct mode hash
    doesn't get a hit.  Then put both hash names on the resulting
    object file, and voila: better hit rates and more unique files in
    the cache.

    It would be interesting to know if many others see anything at all
    similar to the numbers I have here (i.e., lots of people could use
    the enhancement I am suggesting), of if there is something unique
    about this environment that leads to this much duplication in the
    cache.

    Thanks,
    Frank

        On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz
        <[email protected]
        <mailto:[email protected]>>  wrote:

             Hi Martin,

            Thanks for your responses.

            s/index hash/direct mode hash/g

            Apologies - I had a brain burp and was using the wrong
            terminology.

            That aside, however, with  the advent of direct mode,
            there ARE two hashes
            possible for any given object file - the direct mode hash
            (hashing all the
            sources that go into the compilation) and the preprocessed
            hash (hashing the
            result of running all those sources through the
            preprocessor).  And any time
            there is a cache miss, ccache has computed both those
            hashes, hasn't it?
             (Or maybe not - if not, see discussion below.)  And it
            appears to me that
            in many cases, the resulting object file occurs twice in
            the cache, once
            under each hash.  And currently, those two occurrences are
            two separate
            files, which could be combined into a single inode with
            two hard-linked
            directory entries.

            Or am I confused about how direct mode interacts with
            preprocessed mode?  If
            running in direct mode, does ccache never compute the
            preprocessed hash?  If
            not, it obviously could, and I would recommend that it
            should.  Why?
             Because when changes are made to a widely-used header
            file, it very
            commonly occurs that those changes only actually modify
            the preprocessor
            output of a small subset of the sources that include that
            header file, while
            many other sources don't use the changes (say, definition
            of new macros or
            new constants), so end up with the same preprocessed
            output, and the same
            object file, even though the input header files and direct
            mode hash did
            change).  In that case, ccache could still find hits in
            the cache with the
            preprocessed mode, even if it's a miss with the direct
            mode hash.  If ccache
            does not get a direct mode hit, it certainly will have to
            RUN the
            preprocessor to recompile the file - how much extra cost
            to compute the
            preprocessed hash, look it up (to avoid recompiling if it
            is found with THAT
            hash), and if a compile is still needed, store the
            resulting object file
            inode with 2 directory entries rather than just one?

            The way I read the doc about how direct mode works, I
            thought it would
            compute the direct mode hash, and if no hit, "fall back to
            preprocessed
            mode".  I thought that meant it would compute the
            preprocessed hash and look
            for that too.   Is that incorrect - does it only compute
            ONE hash in all
            cases - a direct mode hash if running in direct mode and a
            preprocessed hash
            if not in direct mode?  If so, then let's modify my
            suggested enhancement to
            be that in direct mode, calculate and use the preprocessed
            hash whenever
            there is no hit with direct mode, and create hard links
            using all computed
            hashes to the one single object file inode that eventually
            exists in the
            ccache.  I don't think direct mode and preprocessed mode
            HAVE to be mutually
            exclusive - when direct mode gets a miss, preprocessed
            mode can still often
            provide a hit.

            And if no preprocessed hash gets computed/stored when
            running in direct
            mode, then I suspect that the reason I see so many pairs
            of identical object
            files in my ccache is because of the situation I describe
            above, where a
            header file change has triggered a direct mode hash miss,
            but preprocessing
            the sources has resulted in an identical preprocessed file
            which was then
            passed to the compiler which produced an identical object
            file.  But ccache
            didn't KNOW that they were identical, because it didn't
            compute the
            preprocessed hash.

            Looking at a ccache with about 40,000 .o files in it
            (created with direct
            mode turned on); of the 55 largest files, I found 11 pairs
            and one triplet
            of identical object files.  That's almost 25% of redundant
            storage that
            could have been avoided by looking at the preprocessed
            hash when there is no
            hit in direct mode.

            Thanks,
            Frank


            On 11/07/2011 12:53 AM, Martin Pool wrote:

                On 5 November 2011 11:12, Frank
                Klotz<[email protected]
                <mailto:[email protected]>>
                 wrote:

                     I used ccache at my previous employer, and was
                    very convinced of its
                    value.
                     Now that I have started a new job, I am in the
                    process of trying to
                    bring
                    the new shop on board with ccache, so I have been
                    doing lots of test runs
                    and looking at things.  Here is one thing I am
                    thinking could add some
                    value.

                    Looking through the ccache, I find many pairs of
                    files which have
                    different
                    names (different hashes), but exactly identical
                    content.  This actually
                    makes sense, as each file would have an index hash
                    and a preprocessed
                    hash,
                    and since ccache needs to be able to find a match
                    on either, then both
                    need
                    to be in the cache.

                What is an index hash?

                     (Actually, thinking about it, I'm a little surprised
                    that there are any files in the ccache that DON'T
                    appear twice -
                    shouldn't
                    EVERY compilation have 2 hashes?)

                I don't understand why you would expect that.

                It seems like you expect there is another indirection
                layer by which
                ccache tries to find jobs that produce identical
                output.  I don't
                think there is one at present.  I don't think this
                would happen very
                often in reality, except perhaps for trivial cases
                like compiling
                empty files, and that's not so important to
                accelerate, and it will
                not use up much disk space.

                If you're getting duplicated cache files due to for
                instance doing
                builds in different directories or from different
                trees that produce
                identical output you could change the ccache options
                to make it less
                stringent.

                    But it seems to me that it would make a lot of
                    sense to store the data of
                    these 2 files only once, by hard-linking the 2
                    names to the same inode.
                     (For filesystems that support hard links, of
                    course!)  Every time ccache
                    does an actual compilation and stores a file in
                    the cache, it should
                    store
                    it under hard links for BOTH hashes - the indexed
                    hash and the
                    proprocessed
                    hash.  And if it gets a hash miss on the indexed
                    hash but a hit on the
                    preprocessed hash, then it should add the missed
                    index hash as a hard
                    link
                    to the file found.  So a given file (inode) in the
                    cache could actually
                    be
                    referenced by MANY directory entries: one
                    preprocessed hash, and multiple
                    index hashes for various different combinations of
                    source files and
                    header
                    files which end up producing the same output when
                    passed through the
                    preprocessor.

                This mail is the first time google has heard of
                "ccache indexed hash"...

                    This could increase the storage efficiency of the
                    ccache.

                    Of course, since not every filesystem supports
                    hard links, the simplest
                    solution was of course just to have multiple file
                    copies.  So I guess
                    adding
                    code to do this would require some way to
                    determine if the filesystem the
                    cache is on can in fact support hardlinks.

                    If you think this sounds like a good idea, but
                    don't have bandwidth to do
                    it, I would be willing to give it a try.  Any
                    hints on where to start
                    would
                    of course be welcome.

                    Thanks,
                    Frank Klotz
                    _______________________________________________
                    ccache mailing list
                    [email protected] <mailto:[email protected]>
                    https://lists.samba.org/mailman/listinfo/ccache


            _______________________________________________
            ccache mailing list
            [email protected] <mailto:[email protected]>
            https://lists.samba.org/mailman/listinfo/ccache


_______________________________________________
ccache mailing list
[email protected]
https://lists.samba.org/mailman/listinfo/ccache

Re: [ccache] Duplicate object files in the ccache - possible optimization?

Reply via email to