Re: _AM_FILESYSTEM_TIMESTAMP_RESOLUTION incorrect result (was Re: make check(s) pre-release problems)
Hi Zack, Date: Fri, 07 Oct 2022 11:35:41 -0400 From: Zack Weinberg [...] the filesystem timestamp resolution was incorrectly detected: Your analysis sounds plausible to me, but it's not obvious to me how best to fix it. ls --full-time or stat may not be available. Maybe just do the test again if the first ls -t "succeeds", i.e., if the first 0.1sec straddles a second boundary, the second presumably won't? Or is there a better way? I wonder if you could easily make a patch for it (in m4/sanity.m4), if you already have a setup on a second-granularity fs? (I don't.) --thanks, karl. [...] The filesystem I'm working in only records timestamps at second granularity. (I don't know why ls is printing .1 instead of .0 but it's always .1.) $ touch A && sleep 0.1 && touch B && ls --full-time -t A B -rw-r--r-- 1 zweinber users 0 2022-10-07 11:20:05.1 -0400 A -rw-r--r-- 1 zweinber users 0 2022-10-07 11:20:05.1 -0400 B I *think* this is a bug in _AM_FILESYSTEM_TIMESTAMP_RESOLUTION where, if it starts looping at *just the wrong time*, the first 0.1-second sleep will straddle a second boundary and we'll break out of the loop immediately and think we have 0.1-second timestamp resolution. zw
Re: make check(s) pre-release problems
I actually wonder if your sudden "parallelism" failure could be somehow linked to an update of bash, similar to mine ? Good idea, but my bash hasn't changed ... I don't doubt there would be plenty more failures with the new SHLVL change (any such change seems like a terrible idea, but oh well). Well, more fun in the future. -k
Re: make check(s) pre-release problems
Hello, I don't know if that will help, or if that is completely unrelated, but I'm currently stumbling into a weird issue while working on a new package release for autoconf on Fedora: about 200 tests are now failing, all related to aclocal checks. My current investigation shows that it would be related to a bash update from 5.1.x to 5.2x which seems to have changed the behavior of the "SHLVL" shell variable. I actually wonder if your sudden "parallelism" failure could be somehow linked to an update of bash, similar to mine ? Fred. On Fri, Oct 7, 2022 at 6:11 PM Zack Weinberg wrote: > On Thu, Oct 6, 2022, at 4:25 PM, Karl Berry wrote: > > No errors on RHEL7+autoconf2.71 > > > > Puzzling. Can you easily try RHEL8 or one of its derivatives? > > It surprises me that that is the culprit, but it seems possible. > > Unfortunately, no. CMU is mostly an Ubuntu shop these days. It's only > dumb luck that I happen to have access to a machine that hasn't had a major > system upgrade since 2012 (and with my day-job hat on I'm actively trying > to *get* it upgraded -- to Ubuntu.) > > zw > >
Re: make check(s) pre-release problems
Please don't top-post on this mailing list. On Tue, Oct 11, 2022, at 12:15 PM, Frederic Berat wrote: > On Fri, Oct 7, 2022 at 6:11 PM Zack Weinberg wrote: >> On Thu, Oct 6, 2022, at 4:25 PM, Karl Berry wrote: No errors on RHEL7+autoconf2.71 >>> >>> Puzzling. Can you easily try RHEL8 or one of its derivatives? >>> It surprises me that that is the culprit, but it seems possible. >> >> Unfortunately, no. > > I don't know if that will help, or if that is completely unrelated, > but I'm currently stumbling into a weird issue while working on a > new package release for autoconf on Fedora: about 200 tests are now > failing, all related to aclocal checks. > > My current investigation shows that it would be related to a bash > update from 5.1.x to 5.2x which seems to have changed the behavior > of the "SHLVL" shell variable. This is a known issue, fixed on Autoconf development trunk, see https://git.savannah.gnu.org/cgit/autoconf.git/commit/?id=412166e185c00d6eacbe67dfcb0326f622ec4020 I intend to make a bugfix release of Autoconf in the near future. > I actually wonder if your sudden "parallelism" failure could be > somehow linked to an update of bash, similar to mine ? It's certainly possible. zw
_AM_FILESYSTEM_TIMESTAMP_RESOLUTION incorrect result (was Re: make check(s) pre-release problems)
On Thu, Oct 6, 2022, at 4:19 PM, Zack Weinberg wrote: > On Thu, Oct 6, 2022, at 1:04 PM, Zack Weinberg wrote: >> On 2022-10-04 6:58 PM, Karl Berry wrote: >>> Perhaps easier to debug: there are two targets to be run before making a >>> release, check-no-trailing-backslash-in-recipes and check-cc-no-c-o, >>> to try to ensure no reversion wrt these features. A special shell and >>> compiler are configured, respectively (shell scripts that check the >>> behavior). >> >> I'm running these targets now and will report what I get. > > No errors on RHEL7+autoconf2.71 (same environment I used for the Python fixes) > from a serial "make check-no-trailing-backslash-in-recipes". The other one is > running now. One failure from a serial "make check-cc-no-c-o": FAIL: t/aclocal-autoconf-version-check == Running from installcheck: no Test Protocol: none ... configure: error: newly created file is older than distributed files! Check your system clock make: *** [config.status] Error 1 This doesn't appear to have anything to do with "make check-cc-no-c-o" mode, the problem is that the filesystem timestamp resolution was incorrectly detected: configure:1965: checking whether sleep supports fractional seconds configure:1979: result: true configure:1982: checking the filesystem timestamp resolution configure:2020: result: 0.1 configure:2024: checking whether build environment is sane configure:2079: error: newly created file is older than distributed files! Check your system clock The filesystem I'm working in only records timestamps at second granularity. (I don't know why ls is printing .1 instead of .0 but it's always .1.) $ touch A && sleep 0.1 && touch B && ls --full-time -t A B -rw-r--r-- 1 zweinber users 0 2022-10-07 11:20:05.1 -0400 A -rw-r--r-- 1 zweinber users 0 2022-10-07 11:20:05.1 -0400 B I *think* this is a bug in _AM_FILESYSTEM_TIMESTAMP_RESOLUTION where, if it starts looping at *just the wrong time*, the first 0.1-second sleep will straddle a second boundary and we'll break out of the loop immediately and think we have 0.1-second timestamp resolution. zw
Re: make check(s) pre-release problems
On Thu, Oct 6, 2022, at 4:25 PM, Karl Berry wrote: > No errors on RHEL7+autoconf2.71 > > Puzzling. Can you easily try RHEL8 or one of its derivatives? > It surprises me that that is the culprit, but it seems possible. Unfortunately, no. CMU is mostly an Ubuntu shop these days. It's only dumb luck that I happen to have access to a machine that hasn't had a major system upgrade since 2012 (and with my day-job hat on I'm actively trying to *get* it upgraded -- to Ubuntu.) zw
Re: make check(s) pre-release problems
On Thu, Oct 6, 2022 at 1:28 PM Karl Berry wrote: > > No errors on RHEL7+autoconf2.71 > > Puzzling. Can you easily try RHEL8 or one of its derivatives? > It surprises me that that is the culprit, but it seems possible. > > I'm using autoconf-2.71, make-4.3, etc., compiled from source, but am > using the OS-provided coreutils. I think I'll try compiling that from > source. My problem, at least on F36 was that I'd been using a version of GNU make I probably built from git around July 11, 2021(!) -- "-v" reports 4.3.90, which is not helpful - I would have preferred to know the commit. Once I installed the official 4.3.90, that made it so all of HACKING's pre-release commands pass for me: make bootstrap make -j12 check keep_testdirs=yes make maintainer-check make -j12 distcheck# regular distcheck make -j12 distcheck AM_TESTSUITE_MAKE="make -j$j" # parallelize makes make -j12 check-no-trailing-backslash-in-recipes make -j12 check-cc-no-c-o FTR, using autoconf (GNU Autoconf) 2.72a.57-8b5e2
Re: make check(s) pre-release problems
No errors on RHEL7+autoconf2.71 Puzzling. Can you easily try RHEL8 or one of its derivatives? It surprises me that that is the culprit, but it seems possible. I'm using autoconf-2.71, make-4.3, etc., compiled from source, but am using the OS-provided coreutils. I think I'll try compiling that from source. Thanks to everyone for the suggestions and help. -k
Re: make check(s) pre-release problems
On Thu, Oct 6, 2022, at 1:04 PM, Zack Weinberg wrote: > On 2022-10-04 6:58 PM, Karl Berry wrote: >> Perhaps easier to debug: there are two targets to be run before making a >> release, check-no-trailing-backslash-in-recipes and check-cc-no-c-o, >> to try to ensure no reversion wrt these features. A special shell and >> compiler are configured, respectively (shell scripts that check the >> behavior). > > I'm running these targets now and will report what I get. No errors on RHEL7+autoconf2.71 (same environment I used for the Python fixes) from a serial "make check-no-trailing-backslash-in-recipes". The other one is running now. zw
Re: make check(s) pre-release problems
On 2022-10-04 6:58 PM, Karl Berry wrote: With Zack's latest Python fixes, I was hoping to move towards an Automake release, but I find myself stymied by apparently random and unreproducible test failures. I haven't exhausted every conceivable avenue yet, but I thought I would write in hopes that others (Zack, past Automake developers, anyone else ...) could give it a try, and/or have some insights. For me, running a parallel make check (with or without parallelizing the "internal" makes), or make distcheck, fails some tests, e.g., nodef, nodef2, testsuite-summary-reference-log. The exact tests that fail changes from run to run. Running the tests on their own succeeds. Ok, so it's something in the parallelism. But why? And how to debug? I can't reproduce this problem myself, but my first thought is that some of the tests, when run concurrently, could be overwriting each other's files somehow. I can think of two ways to investigate that hypothesis: look for tests that write files outside a directory dedicated to that test, and, after a failed test run, look for files that are corrupted, then try to figure out which tests would be stomping on those files. Perhaps easier to debug: there are two targets to be run before making a release, check-no-trailing-backslash-in-recipes and check-cc-no-c-o, to try to ensure no reversion wrt these features. A special shell and compiler are configured, respectively (shell scripts that check the behavior). I'm running these targets now and will report what I get. zw
Re: make check(s) pre-release problems
> On 4 Oct 2022, at 23:58, Karl Berry wrote: > > With Zack's latest Python fixes, I was hoping to move towards an > Automake release, but I find myself stymied by apparently random and > unreproducible test failures. I haven't exhausted every conceivable > avenue yet, but I thought I would write in hopes that others (Zack, past > Automake developers, anyone else ...) could give it a try, and/or have > some insights. > > For me, running a parallel make check (with or without parallelizing the > "internal" makes), or make distcheck, fails some tests, e.g., nodef, > nodef2, testsuite-summary-reference-log. The exact tests that fail > changes from run to run. Running the tests on their own succeeds. Ok, so > it's something in the parallelism. But why? And how to debug? > > Nothing has changed in the tests. Nothing has changed in the automake > infrastructure. Everything worked for me a few weeks ago. Furthermore, > Jim ran make check with much more parallelism than my machine can > muster, and everything succeeded for him. That was with: > make check TESTSUITEFLAGS=-j20 > > Any ideas, directions, fixes, greatly appreciated. --thanks, karl. > Is there a way to ask your distribution's package manager which upgrades/downgrades were done in the last N weeks? It'd also be helpful to see the actual failures, although as Paul notes, make --shuffle with latest non-released make could help debugging. signature.asc Description: Message signed with OpenPGP
Re: make check(s) pre-release problems
On Wed, 2022-10-05 at 15:24 -0600, Karl Berry wrote: > What troubles me most is that there's no obvious way to debug any > test failure involving parallelism, since they go away with serial > execution. Any ideas about how to determine what is going wrong in > the parallel make? Any way to make parallel failures more > reproducible? I don't have any great ideas myself. In the new prerelease of GNU make there's a --shuffle option which will randomize (or just reverse) the order in which prerequisites are built. Often if you have a timing-dependent failure, forcing the prerequisites to build in a different order can make the failure more obvious. In general, though, the best way to attack the issue is to try to understand why the failure happens: what goes wrong that causes the failure. If that can be understood then often we can envision a way that parallel or "out of order" builds might cause that problem. Alternatively since you seem to have relatively well-defined "good" and "bad" commits you could use git bisect to figure out which commit actually causes the problem (obviously you need to be able to force the failure, if not every time then at least often enough to detect a "bad" commit). Maybe that will shed some light. But I expect there's nothing here you haven't already thought of yourself :( :).
Re: make check(s) pre-release problems
On Wednesday 2022-10-05 23:24, Karl Berry wrote: > >What troubles me most is that there's no obvious way to debug any test >failure involving parallelism, since they go away with serial execution. >Any ideas about how to determine what is going wrong in the parallel >make? Any way to make parallel failures more reproducible? 1. Throw more processes in the mix (make -jN with more-than-normal N) so that either - for each (single) process the "critical section" execution time goes up - for the whole job set, the total time spent in/around critical sections goes up 2. determine which exact (sub-)program and syscall failed in what process in what job (strace), then construct a hypothesis around that failure 3. watch if any one job is somehow executed twice, or a file is written to concurrently foo: foo.c foo.h ld -o foo ... foo.c foo.h: generate_from_somewhere 3b. or a file is read and written to concurrently %.o: %.c generate_version.h cc -o $@ $< foo: foo.o bar.o (and foo.c, bar.c, nongenerated, have a #include "version.h") I've seen something like that in libtracefs commit b64dc07ca44ccfed40eae8d345867fd938ce6e0e
Re: make check(s) pre-release problems
What version of GNU make are you using? I've been using make 4.3 since its release in 2020. No changes, no prereleases. I'm afraid the problem, whatever it is, is not that simple :(. What troubles me most is that there's no obvious way to debug any test failure involving parallelism, since they go away with serial execution. Any ideas about how to determine what is going wrong in the parallel make? Any way to make parallel failures more reproducible? Right now, all I know is "some random test(s) fail(s)". Not helpful. The test logs show the set -x execution of the test, so the actual command that fails can be seen. I have keep_testdirs=yes, of course, but then running the command by hand in the shell after the failure often does not reproduce the problem. Something else is going on, but my imagination about what that might be has failed so far :(. Argh. --thanks, karl.
Re: make check(s) pre-release problems
On Wed, 2022-10-05 at 05:27 +0200, Jan Engelhardt wrote: > > So what the heck?[...] >These always worked before. But now, Jim > > gets hundreds of failures with the first > > Make was in the news recently, maybe that's the component to > switch out for an earlier version? > > 7ad2593b Support implementing the jobserver using named pipes What version of GNU make are you using? There has been no new release of GNU make (yet). If you're running a prerelease version of GNU make, you might consider trying it with the last official release (4.3) to see if it helps. Certainly if something is failing with the GNU make prerelease that would be good to know, however.
Re: make check(s) pre-release problems
On Wednesday 2022-10-05 00:58, Karl Berry wrote: > >Nothing has changed in the tests. Nothing has changed in the automake >infrastructure. Everything worked for me a few weeks ago. Furthermore, >Jim ran make check with much more parallelism than my machine can >muster, and everything succeeded for him. That was with: > make check TESTSUITEFLAGS=-j20 > >So what the heck?[...] >These always worked before. But now, Jim gets hundreds >of failures with >the first Make was in the news recently, maybe that's the component to switch out for an earlier version? 7ad2593b Support implementing the jobserver using named pipes
make check(s) pre-release problems
With Zack's latest Python fixes, I was hoping to move towards an Automake release, but I find myself stymied by apparently random and unreproducible test failures. I haven't exhausted every conceivable avenue yet, but I thought I would write in hopes that others (Zack, past Automake developers, anyone else ...) could give it a try, and/or have some insights. For me, running a parallel make check (with or without parallelizing the "internal" makes), or make distcheck, fails some tests, e.g., nodef, nodef2, testsuite-summary-reference-log. The exact tests that fail changes from run to run. Running the tests on their own succeeds. Ok, so it's something in the parallelism. But why? And how to debug? Nothing has changed in the tests. Nothing has changed in the automake infrastructure. Everything worked for me a few weeks ago. Furthermore, Jim ran make check with much more parallelism than my machine can muster, and everything succeeded for him. That was with: make check TESTSUITEFLAGS=-j20 So what the heck? Perhaps easier to debug: there are two targets to be run before making a release, check-no-trailing-backslash-in-recipes and check-cc-no-c-o, to try to ensure no reversion wrt these features. A special shell and compiler are configured, respectively (shell scripts that check the behavior). These always worked before. But now, Jim gets hundreds of failures with the first (didn't have time to try the second). I get a couple, with both, instead of hundreds. Again the failing tests vary. In this case, they fail for me even without parallelism. So what the heck x 2? Any ideas, directions, fixes, greatly appreciated. --thanks, karl.