Re: [ccache] direct mode design bug

2012-11-07 Thread Joel Rosdahl
Many thanks for the answer!

On 5 November 2012 14:53, Andrew Stubbs a...@codesourcery.com wrote:

 My first reaction to this issue, rightly or wrongly, is that it's more of
 a documentation issue than a real bug. I mean, it can only occur if two
 people share a cache, or if the user installs new software and then reuses
 an old cache.


It can happen in other cases as well. Contrieved example, but still:

rm -rf subdir file.c config.h
echo '#include config.h' file.c
mkdir subdir
echo '#warning subdir/config.h used' subdir/config.h
sleep 1
ccache gcc -Isubdir -c file.c
# User: Oops, forgot to create ./config.h.
echo '#warning config.h used' config.h
sleep 1
ccache gcc -Isubdir -c file.c
# User: Wat? Why isn't ./config.h used?


For a real life example, see
https://bugzilla.samba.org/show_bug.cgi?id=8424#c0.

If the documentation simply says that you have to wipe your cache whenever
 you do that sort of thing then does that solve the problem?


It would be nice if ccache were only used and enabled by conscious users
who have read and understood the documentation, but in reality that doesn't
happen in many cases. For instance, Linux distributions like Fedora install
and enable ccache by default (masquerading the system compiler), at least
when installing the development environment or similar. That's not
surprising given that ccache works very well for most people and that it is
advertised as being very safe.

There are several other cases where ccache's behavior doesn't fully match
that of the real compiler - I'm just a bit worried that the direct mode
issue we're discussing perhaps is too much of a behavior mismatch.

Hm. Coming to think of it, nothing stops Fedora et al from disabling direct
mode by default even if ccache's own default is to enable it.

A similar issue, albeit not so interesting, perhaps, is what happens when a
 user changes some part of the toolchain, but does not alter the gcc
 binary. Ccache won't notice a new back-end compiler, a new assembler, a new
 linker, a new default specs file or anything like that. Chances are that
 any differences in the output are harmless, but the cached objects are
 technically invalid.


Right. However, isn't the the fact that ccache may be affected by toolchain
changes much less surprising than the fact that ccache may fail to pick up
header files correctly?


 [In fact, I have a use-case in which I have multiple users sharing a
 cache, and I wanted to be able to uniquely identify the same toolchain
 across all the installations. The mtime etc. varies from machine to
 machine, as might the exact tool mix, so I have some local patches to do a
 much deeper hash of the toolchain binaries, and include those in the object
 hashes. Even then, in the interests of performance, those toolchain IDs are
 cached according to the location and mtime, so changing the binutils will
 cause temporarily wrong toolchain hashes. Would you be interested in such a
 feature upstream?]


Perhaps, it depends on how intrusive it is and how toolchain-specific it is.

3. ccache could try to imitate what the preprocessor does.


 Yuck. If you can program a faster preprocessor I'm sure the GCC folks
 would love to see it.


Thankfully, my suggestion wasn't to create a preprocessor substitute. :-)

You wouldn't get to dumb much down unless you're fine with running both
 your own preprocessor and then the real one for the preprocessor mode cache
 check.


Yes, that's of course what I had in mind.


 Even if you only wanted to look for #include statements you'd still need
 to evaluate all the #if directives.


Not sure about that. I maybe overlook something, but ccache would only
have to follow all #include statements and note all header files that don't
exist in the include path list. (When #include is used with a #defined
token for the filename, fall back to the real compiler.) When considering a
potential cache hit, reject it if any of the header files that didn't exist
then exist now.

 Anybody got other ideas?


 Running the compiler with -v prints the header search directories. You
 could use that to do your own scan.


To use the directories from cpp -v (plus directories from the command
line) to do some optimistic validation was my first thought as well, but
after thinking more about it I came to the conclusion that it wouldn't buy
much safety because no subdirectories will be checked, and you can't tell
which subdirectories to check unless you have parsed the #include
statements. Also, it would trigger many false negatives.

BTW, gcc has an option --trace-includes that might be faster than
 scanning the preprocessor output, although the compiler still has to do all
 the same work. Like this: gcc -E hello.c -o /dev/null.


How do you use --trace-includes? It doesn't seem to be documented and
nothing happens when I try it.

Please leave it on. The difference is like night and day, and the bug is
 rare and avoidable.


OK, we so far have one vote for and zero against. Any other? 

Re: [ccache] direct mode design bug

2012-11-07 Thread Joel Rosdahl
On 5 November 2012 16:31, Andrew Stubbs a...@codesourcery.com wrote:

 Incidentally, you appear to have committed a patch updating the
 documentation stating that direct mode is off by default, but in the code
 direct_mode is still true, by default.


Yes, I started sketching on disabling it by default but stopped halfway
because I couldn't make up my mind at the time. I'll fix it, thanks.

-- Joel
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] direct mode design bug

2012-11-07 Thread Eitan Adler
On 7 November 2012 14:19, Joel Rosdahl j...@rosdahl.net wrote:
 Hm. Coming to think of it, nothing stops Fedora et al from disabling direct
 mode by default even if ccache's own default is to enable it.

As a package maintainer I would like to discourage this view.
Downstream maintainers shouldn't have to modify the upstream default
except in extreme cases. This makes things confusing for the users and
results in weird questions on the mailing lists.

-- 
Eitan Adler
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] direct mode design bug

2012-11-05 Thread Andrew Stubbs

On 04/11/12 19:10, Joel Rosdahl wrote:

The direct mode, which was introduced in version 3.0 almost three years
ago, has a design bug. The essence of the problem is that in the direct
mode, ccache records header files that were used by the compiler, but it
doesn't record header files that were not used but could have been used if
they existed. So, when ccache checks if a result could be taken from
the cache, it can't check if the existence of a new header file should
invalidate the result.


My first reaction to this issue, rightly or wrongly, is that it's more 
of a documentation issue than a real bug. I mean, it can only occur if 
two people share a cache, or if the user installs new software and then 
reuses an old cache. If the documentation simply says that you have to 
wipe your cache whenever you do that sort of thing then does that solve 
the problem?


A similar issue, albeit not so interesting, perhaps, is what happens 
when a user changes some part of the toolchain, but does not alter the 
gcc binary. Ccache won't notice a new back-end compiler, a new 
assembler, a new linker, a new default specs file or anything like that. 
Chances are that any differences in the output are harmless, but the 
cached objects are technically invalid.


Having said all that, if Ccache Just Worked, that would be no bad thing.

[In fact, I have a use-case in which I have multiple users sharing a 
cache, and I wanted to be able to uniquely identify the same toolchain 
across all the installations. The mtime etc. varies from machine to 
machine, as might the exact tool mix, so I have some local patches to do 
a much deeper hash of the toolchain binaries, and include those in the 
object hashes. Even then, in the interests of performance, those 
toolchain IDs are cached according to the location and mtime, so 
changing the binutils will cause temporarily wrong toolchain hashes. 
Would you be interested in such a feature upstream?]



1. ccache could use strace or similar ways of monitoring the compiler and
tracing the performed system calls to find out where headers were probed. I
haven't measured, but I suspect that this would be slow.


The ptrace is quite easy to use, but it would be slow, and not terribly 
portable, plus you'd have to ignore all the other stat gubbins that a 
toolchain indulges in.



2. ccache could override strategic functions using LD_PRELOAD, thus
snooping on system calls without involving the kernel. This should be
possible and quite fast, but it's tricky to get right, and it's not very
portable. (By the way: This is what
http://audited-objects.sourceforge.netdoes, although I don't know if
it monitors and acts on probes of
nonexistent files.)


Faster, but more fragile, and I still don't like it.


3. ccache could try to imitate what the preprocessor does. That is, read
the source code file and follow #include statements instead of looking at
the preprocessor output. This essentially means implementing a dumbed down
version of a preprocessor, a task that doesn't sound trivial: It has to be
significantly faster than the real preprocessor to make a difference, it
will be more coupled to the behavior of the compiler and its various
options (-I, -idirafter, -isystem, etc), and it probably has to know the
compiler's default include directories.


Yuck. If you can program a faster preprocessor I'm sure the GCC folks 
would love to see it. You wouldn't get to dumb much down unless you're 
fine with running both your own preprocessor and then the real one for 
the preprocessor mode cache check. Even if you only wanted to look for 
#include statements you'd still need to evaluate all the #if directives. 
You could make it faster by ignoring the tokenization pass, but then 
you'd get other subtle bugs.



Anybody got other ideas?


Running the compiler with -v prints the header search directories. You 
could use that to do your own scan. It would be difficult to 
differentiate files specified by the user with absolute paths from files 
found by the compiler.


I suggest it would be better to do just the minimum to determine if a 
cached file is unsafe. Perhaps you could hash the directory stat for the 
include directories listed by gcc -v? (I've checked, and there doesn't 
seem to be a -print-... option for the include path.)


E.g. gcc -v -c hello.c gives:
.
ignoring nonexistent directory /usr/local/include/x86_64-linux-gnu
ignoring nonexistent directory 
/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../x86_64-linux-gnu/include

#include ... search starts here:
#include ... search starts here:
 /usr/lib/gcc/x86_64-linux-gnu/4.7/include
 /usr/local/include
 /usr/lib/gcc/x86_64-linux-gnu/4.7/include-fixed
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
..

so, you could stat the directories listed, and disallow direct mode if 
the mtime has changed since the manifest was last written. The paths to 
stat could be cached in the manifest.


Extra points if direct mode only fails when 

Re: [ccache] direct mode design bug

2012-11-05 Thread Andrew Stubbs

On 04/11/12 19:10, Joel Rosdahl wrote:

Since a quick fix likely isn't possible in the short term, and I would like
to release ccache 3.2 soon, we have to decide whether the direct mode
should default to off or on. Please share any opinions!


Incidentally, you appear to have committed a patch updating the 
documentation stating that direct mode is off by default, but in the 
code direct_mode is still true, by default.


Andrew

___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache