On 26 Feb 2013, Paul Eggert outgrape: > On 02/26/13 13:48, Nix wrote: > >> #2 0x00000000004021e0 in test_utimens (print=print@entry=true, >> func=0x401890 <do_utimensat>) at test-utimens.h:35 > > If that line number is right, the test program > did a creat (file, 0600) that succeeded, followed > by a stat (file, &st) that failed.
Looks like it. Running test-utimensat in a tight loop shows frequent assertion failures: running it under high load (e.g. while a 'make -j 9 check' is running) shows fewer: No load: % START=$(date +%s); while [[ $(($(date +%s)-$START)) -lt 10 ]]; do ./test-utimensat; done test-lutimens.h:162: assertion failed test-lutimens.h:162: assertion failed test-utimens.h:103: assertion failed test-utimens.h:41: assertion failed test-lutimens.h:162: assertion failed Parallel make -j 9 check: % START=$(date +%s); while [[ $(($(date +%s)-$START)) -lt 10 ]]; do ./test-utimensat; done test-utimens.h:113: assertion failed test-utimens.h:37: assertion failed test-utimens.h:130: assertion failed test-utimens.h:113: assertion failed test-utimens.h:103: assertion failed test-lutimens.h:64: assertion failed test-utimens.h:128: assertion failed test-utimens.h:103: assertion failed test-lutimens.h:64: assertion failed I note that this is not exactly healthy hardware: it is has a slowly failing fan and is going in for repair tomorrow. However it isn't overheating yet and *did* just complete a GCC LTO bootstrap-and-test with no FAILs I wasn't expecting so I don't think the fan failure is likely to be the cause of these assertions. As is typical of timing-related faults, running it under strace makes the frequency of the problems plunge. I've stuck a tarball of all the distinct failing straces I could get in a few hours of tight loops under strace up at <http://www.esperi.org.uk/~nix/bugs/gnulib/utimensat.tar.gz>. (I truncated the execve lines because dumping my entire process environment out in public is not something I'm entirely happy doing. Paranoia, maybe. :) ) Classifying the small set of assertion failures above (a set small enough that I felt I could attack it without dying of boredom), we see: test-utimens.h:37: assertion failed ASSERT (func (BASE "file", NULL) == 0); test-utimens.h:113: assertion failed ASSERT (st3.st_atime == Y2K); test-utimens.h:128: assertion failed ASSERT (get_stat_mtime_ns (&st3) == get_stat_mtime_ns (&st2)); test-lutimens.h:64: assertion failed test-lutimens.h:162: assertion failed test-utimens.h:41: assertion failed test-utimens.h:103: assertion failed test-utimens.h:130: assertion failed ASSERT (ctime_compare (&st1, &st2) < 0); ASSERT (ctime_compare (&st2, &st3) < 0); I thought it was worth my analyzing at least the largest set of failures, this one. This is simply a race. When these assertions are hit, I see things like this: st1.st_ctime: 1361977729; st2.st_ctime: 1361977729; compare: 0; get_stat_ctime_ns(st1): 613474299, get_stat_ctime_ns(st2): 613474299 i.e. the nsec values are identical. This is surely valid: it just means the two stats were carried out in the same 1ns interval. All these asserts should be <= 0. (I just saw an identical failure in test-lchown.h:112.)
