Re: _AM_FILESYSTEM_TIMESTAMP_RESOLUTION incorrect result (was Re: make check(s) pre-release problems)

2023-06-30 Thread Karl Berry
Hi Zack,

Date: Fri, 07 Oct 2022 11:35:41 -0400
From: Zack Weinberg 

[...]
the filesystem timestamp resolution was incorrectly detected:

Your analysis sounds plausible to me, but it's not obvious to me how
best to fix it. ls --full-time or stat may not be available.

Maybe just do the test again if the first ls -t "succeeds", i.e., if the
first 0.1sec straddles a second boundary, the second presumably won't?
Or is there a better way?

I wonder if you could easily make a patch for it (in m4/sanity.m4), if
you already have a setup on a second-granularity fs? (I don't.)
--thanks, karl.

[...]
The filesystem I'm working in only records timestamps at second
granularity. (I don't know why ls is printing .1 instead of
.0 but it's always .1.)

$ touch A && sleep 0.1 && touch B && ls --full-time -t A B
-rw-r--r-- 1 zweinber users 0 2022-10-07 11:20:05.1 -0400 A
-rw-r--r-- 1 zweinber users 0 2022-10-07 11:20:05.1 -0400 B

I *think* this is a bug in _AM_FILESYSTEM_TIMESTAMP_RESOLUTION where, if
it starts looping at *just the wrong time*, the first 0.1-second sleep
will straddle a second boundary and we'll break out of the loop
immediately and think we have 0.1-second timestamp resolution.
zw



Re: make check(s) pre-release problems

2022-10-11 Thread Karl Berry
I actually wonder if your sudden "parallelism" failure could be somehow
linked to an update of bash, similar to mine ?

Good idea, but my bash hasn't changed ... I don't doubt there would be
plenty more failures with the new SHLVL change (any such change seems
like a terrible idea, but oh well). Well, more fun in the future. -k



Re: make check(s) pre-release problems

2022-10-11 Thread Frederic Berat
Hello,

I don't know if that will help, or if that is completely unrelated, but I'm
currently stumbling into a weird issue while working on a new package
release for autoconf on Fedora: about 200 tests are now failing, all
related to aclocal checks.
My current investigation shows that it would be related to a bash update
from 5.1.x to 5.2x which seems to have changed the behavior of the "SHLVL"
shell variable.

I actually wonder if your sudden "parallelism" failure could be somehow
linked to an update of bash, similar to mine ?

Fred.

On Fri, Oct 7, 2022 at 6:11 PM Zack Weinberg  wrote:

> On Thu, Oct 6, 2022, at 4:25 PM, Karl Berry wrote:
> > No errors on RHEL7+autoconf2.71
> >
> > Puzzling. Can you easily try RHEL8 or one of its derivatives?
> > It surprises me that that is the culprit, but it seems possible.
>
> Unfortunately, no.  CMU is mostly an Ubuntu shop these days.  It's only
> dumb luck that I happen to have access to a machine that hasn't had a major
> system upgrade since 2012 (and with my day-job hat on I'm actively trying
> to *get* it upgraded -- to Ubuntu.)
>
> zw
>
>


Re: make check(s) pre-release problems

2022-10-11 Thread Zack Weinberg
Please don't top-post on this mailing list.

On Tue, Oct 11, 2022, at 12:15 PM, Frederic Berat wrote:
> On Fri, Oct 7, 2022 at 6:11 PM Zack Weinberg  wrote:
>> On Thu, Oct 6, 2022, at 4:25 PM, Karl Berry wrote:
 No errors on RHEL7+autoconf2.71
>>>
>>> Puzzling. Can you easily try RHEL8 or one of its derivatives?
>>> It surprises me that that is the culprit, but it seems possible.
>>
>> Unfortunately, no.
>
> I don't know if that will help, or if that is completely unrelated,
> but I'm currently stumbling into a weird issue while working on a
> new package release for autoconf on Fedora: about 200 tests are now
> failing, all related to aclocal checks.
>
> My current investigation shows that it would be related to a bash
> update from 5.1.x to 5.2x which seems to have changed the behavior
> of the "SHLVL" shell variable.

This is a known issue, fixed on Autoconf development trunk, see
https://git.savannah.gnu.org/cgit/autoconf.git/commit/?id=412166e185c00d6eacbe67dfcb0326f622ec4020

I intend to make a bugfix release of Autoconf in the near future.

> I actually wonder if your sudden "parallelism" failure could be
> somehow linked to an update of bash, similar to mine ?

It's certainly possible.

zw



_AM_FILESYSTEM_TIMESTAMP_RESOLUTION incorrect result (was Re: make check(s) pre-release problems)

2022-10-07 Thread Zack Weinberg
On Thu, Oct 6, 2022, at 4:19 PM, Zack Weinberg wrote:
> On Thu, Oct 6, 2022, at 1:04 PM, Zack Weinberg wrote:
>> On 2022-10-04 6:58 PM, Karl Berry wrote:
>>> Perhaps easier to debug: there are two targets to be run before making a
>>> release, check-no-trailing-backslash-in-recipes and check-cc-no-c-o,
>>> to try to ensure no reversion wrt these features. A special shell and
>>> compiler are configured, respectively (shell scripts that check the
>>> behavior).
>>
>> I'm running these targets now and will report what I get.
>
> No errors on RHEL7+autoconf2.71 (same environment I used for the Python fixes)
> from a serial "make check-no-trailing-backslash-in-recipes".  The other one is
> running now.

One failure from a serial "make check-cc-no-c-o":

FAIL: t/aclocal-autoconf-version-check
==

Running from installcheck: no
Test Protocol: none
...
configure: error: newly created file is older than distributed files!
Check your system clock
make: *** [config.status] Error 1

This doesn't appear to have anything to do with "make check-cc-no-c-o" mode, 
the problem is that the filesystem timestamp resolution was incorrectly 
detected:

configure:1965: checking whether sleep supports fractional seconds
configure:1979: result: true
configure:1982: checking the filesystem timestamp resolution
configure:2020: result: 0.1
configure:2024: checking whether build environment is sane
configure:2079: error: newly created file is older than distributed files!
Check your system clock

The filesystem I'm working in only records timestamps at second granularity. (I 
don't know why ls is printing .1 instead of .0 but it's always 
.1.)

$ touch A && sleep 0.1 && touch B && ls --full-time -t A B
-rw-r--r-- 1 zweinber users 0 2022-10-07 11:20:05.1 -0400 A
-rw-r--r-- 1 zweinber users 0 2022-10-07 11:20:05.1 -0400 B

I *think* this is a bug in _AM_FILESYSTEM_TIMESTAMP_RESOLUTION where, if it 
starts looping at *just the wrong time*, the first 0.1-second sleep will 
straddle a second boundary and we'll break out of the loop immediately and 
think we have 0.1-second timestamp resolution.

zw



Re: make check(s) pre-release problems

2022-10-07 Thread Zack Weinberg
On Thu, Oct 6, 2022, at 4:25 PM, Karl Berry wrote:
> No errors on RHEL7+autoconf2.71
>
> Puzzling. Can you easily try RHEL8 or one of its derivatives?
> It surprises me that that is the culprit, but it seems possible.

Unfortunately, no.  CMU is mostly an Ubuntu shop these days.  It's only dumb 
luck that I happen to have access to a machine that hasn't had a major system 
upgrade since 2012 (and with my day-job hat on I'm actively trying to *get* it 
upgraded -- to Ubuntu.)

zw



Re: make check(s) pre-release problems

2022-10-07 Thread Jim Meyering
On Thu, Oct 6, 2022 at 1:28 PM Karl Berry  wrote:
>
> No errors on RHEL7+autoconf2.71
>
> Puzzling. Can you easily try RHEL8 or one of its derivatives?
> It surprises me that that is the culprit, but it seems possible.
>
> I'm using autoconf-2.71, make-4.3, etc., compiled from source, but am
> using the OS-provided coreutils. I think I'll try compiling that from
> source.

My problem, at least on F36 was that I'd been using a version of GNU
make I probably built from git around July 11, 2021(!) -- "-v" reports
4.3.90, which is not helpful - I would have preferred to know the
commit.
Once I installed the official 4.3.90, that made it so all of HACKING's
pre-release commands pass for me:
make bootstrap
make -j12 check keep_testdirs=yes
make maintainer-check
make -j12 distcheck# regular distcheck
make -j12 distcheck AM_TESTSUITE_MAKE="make -j$j"  # parallelize makes
make -j12 check-no-trailing-backslash-in-recipes
make -j12 check-cc-no-c-o

FTR, using autoconf (GNU Autoconf) 2.72a.57-8b5e2



Re: make check(s) pre-release problems

2022-10-06 Thread Karl Berry
No errors on RHEL7+autoconf2.71

Puzzling. Can you easily try RHEL8 or one of its derivatives?
It surprises me that that is the culprit, but it seems possible.

I'm using autoconf-2.71, make-4.3, etc., compiled from source, but am
using the OS-provided coreutils. I think I'll try compiling that from
source.

Thanks to everyone for the suggestions and help. -k




Re: make check(s) pre-release problems

2022-10-06 Thread Zack Weinberg
On Thu, Oct 6, 2022, at 1:04 PM, Zack Weinberg wrote:
> On 2022-10-04 6:58 PM, Karl Berry wrote:
>> Perhaps easier to debug: there are two targets to be run before making a
>> release, check-no-trailing-backslash-in-recipes and check-cc-no-c-o,
>> to try to ensure no reversion wrt these features. A special shell and
>> compiler are configured, respectively (shell scripts that check the
>> behavior).
>
> I'm running these targets now and will report what I get.

No errors on RHEL7+autoconf2.71 (same environment I used for the Python fixes)
from a serial "make check-no-trailing-backslash-in-recipes".  The other one is
running now.

zw



Re: make check(s) pre-release problems

2022-10-06 Thread Zack Weinberg

On 2022-10-04 6:58 PM, Karl Berry wrote:

With Zack's latest Python fixes, I was hoping to move towards an
Automake release, but I find myself stymied by apparently random and
unreproducible test failures. I haven't exhausted every conceivable
avenue yet, but I thought I would write in hopes that others (Zack, past
Automake developers, anyone else ...) could give it a try, and/or have
some insights.

For me, running a parallel make check (with or without parallelizing the
"internal" makes), or make distcheck, fails some tests, e.g., nodef,
nodef2, testsuite-summary-reference-log. The exact tests that fail
changes from run to run. Running the tests on their own succeeds. Ok, so
it's something in the parallelism. But why? And how to debug?


I can't reproduce this problem myself, but my first thought is that some 
of the tests, when run concurrently, could be overwriting each other's 
files somehow.  I can think of two ways to investigate that hypothesis: 
look for tests that write files outside a directory dedicated to that 
test, and, after a failed test run, look for files that are corrupted,

then try to figure out which tests would be stomping on those files.


Perhaps easier to debug: there are two targets to be run before making a
release, check-no-trailing-backslash-in-recipes and check-cc-no-c-o,
to try to ensure no reversion wrt these features. A special shell and
compiler are configured, respectively (shell scripts that check the
behavior).


I'm running these targets now and will report what I get.

zw



Re: make check(s) pre-release problems

2022-10-06 Thread Sam James


> On 4 Oct 2022, at 23:58, Karl Berry  wrote:
> 
> With Zack's latest Python fixes, I was hoping to move towards an
> Automake release, but I find myself stymied by apparently random and
> unreproducible test failures. I haven't exhausted every conceivable
> avenue yet, but I thought I would write in hopes that others (Zack, past
> Automake developers, anyone else ...) could give it a try, and/or have
> some insights.
> 
> For me, running a parallel make check (with or without parallelizing the
> "internal" makes), or make distcheck, fails some tests, e.g., nodef,
> nodef2, testsuite-summary-reference-log. The exact tests that fail
> changes from run to run. Running the tests on their own succeeds. Ok, so
> it's something in the parallelism. But why? And how to debug?
> 
> Nothing has changed in the tests. Nothing has changed in the automake
> infrastructure. Everything worked for me a few weeks ago. Furthermore,
> Jim ran make check with much more parallelism than my machine can
> muster, and everything succeeded for him. That was with:
>  make check TESTSUITEFLAGS=-j20
> 
> Any ideas, directions, fixes, greatly appreciated. --thanks, karl.
> 

Is there a way to ask your distribution's package manager
which upgrades/downgrades were done in the last N weeks?

It'd also be helpful to see the actual failures, although
as Paul notes, make --shuffle with latest non-released
make could help debugging.


signature.asc
Description: Message signed with OpenPGP


Re: make check(s) pre-release problems

2022-10-06 Thread Paul Smith
On Wed, 2022-10-05 at 15:24 -0600, Karl Berry wrote:
> What troubles me most is that there's no obvious way to debug any
> test failure involving parallelism, since they go away with serial
> execution.  Any ideas about how to determine what is going wrong in
> the parallel make?  Any way to make parallel failures more
> reproducible?

I don't have any great ideas myself.

In the new prerelease of GNU make there's a --shuffle option which will
randomize (or just reverse) the order in which prerequisites are built.
Often if you have a timing-dependent failure, forcing the prerequisites
to build in a different order can make the failure more obvious.

In general, though, the best way to attack the issue is to try to
understand why the failure happens: what goes wrong that causes the
failure.  If that can be understood then often we can envision a way
that parallel or "out of order" builds might cause that problem.

Alternatively since you seem to have relatively well-defined "good" and
"bad" commits you could use git bisect to figure out which commit
actually causes the problem (obviously you need to be able to force the
failure, if not every time then at least often enough to detect a "bad"
commit).  Maybe that will shed some light.

But I expect there's nothing here you haven't already thought of
yourself :( :).



Re: make check(s) pre-release problems

2022-10-06 Thread Jan Engelhardt


On Wednesday 2022-10-05 23:24, Karl Berry wrote:
>
>What troubles me most is that there's no obvious way to debug any test
>failure involving parallelism, since they go away with serial execution.
>Any ideas about how to determine what is going wrong in the parallel
>make?  Any way to make parallel failures more reproducible?

1. Throw more processes in the mix (make -jN with more-than-normal N)
   so that either
   - for each (single) process the "critical section" execution time goes up
   - for the whole job set, the total time spent in/around critical sections
 goes up

2. determine which exact (sub-)program and syscall failed in what process in
  what job (strace), then construct a hypothesis around that failure

3. watch if any one job is somehow executed twice, or a file is written to
   concurrently

   foo: foo.c foo.h
ld -o foo ...
   foo.c foo.h:
generate_from_somewhere

3b. or a file is read and written to concurrently

   %.o: %.c
 generate_version.h
 cc -o $@ $<

   foo: foo.o bar.o

(and foo.c, bar.c, nongenerated, have a #include "version.h")
I've seen something like that in libtracefs commit 
b64dc07ca44ccfed40eae8d345867fd938ce6e0e



Re: make check(s) pre-release problems

2022-10-05 Thread Karl Berry
What version of GNU make are you using?

I've been using make 4.3 since its release in 2020. No changes, no
prereleases.  I'm afraid the problem, whatever it is, is not that simple :(.

What troubles me most is that there's no obvious way to debug any test
failure involving parallelism, since they go away with serial execution.
Any ideas about how to determine what is going wrong in the parallel
make?  Any way to make parallel failures more reproducible?

Right now, all I know is "some random test(s) fail(s)". Not helpful. The
test logs show the set -x execution of the test, so the actual command
that fails can be seen.  I have keep_testdirs=yes, of course, but then
running the command by hand in the shell after the failure often does
not reproduce the problem.

Something else is going on, but my imagination about what that might be
has failed so far :(.  Argh.  --thanks, karl.



Re: make check(s) pre-release problems

2022-10-05 Thread Paul Smith
On Wed, 2022-10-05 at 05:27 +0200, Jan Engelhardt wrote:
> > So what the heck?[...] >These always worked before. But now, Jim
> > gets hundreds of failures with the first
> 
> Make was in the news recently, maybe that's the component to
> switch out for an earlier version?
> 
> 7ad2593b Support implementing the jobserver using named pipes

What version of GNU make are you using?  There has been no new release
of GNU make (yet).  If you're running a prerelease version of GNU make,
you might consider trying it with the last official release (4.3) to
see if it helps.

Certainly if something is failing with the GNU make prerelease that
would be good to know, however.



Re: make check(s) pre-release problems

2022-10-04 Thread Jan Engelhardt


On Wednesday 2022-10-05 00:58, Karl Berry wrote:
>
>Nothing has changed in the tests. Nothing has changed in the automake
>infrastructure. Everything worked for me a few weeks ago. Furthermore,
>Jim ran make check with much more parallelism than my machine can
>muster, and everything succeeded for him. That was with:
>  make check TESTSUITEFLAGS=-j20
>
>So what the heck?[...] >These always worked before. But now, Jim gets hundreds 
>of failures with
>the first

Make was in the news recently, maybe that's the component to
switch out for an earlier version?


7ad2593b Support implementing the jobserver using named pipes



make check(s) pre-release problems

2022-10-04 Thread Karl Berry
With Zack's latest Python fixes, I was hoping to move towards an
Automake release, but I find myself stymied by apparently random and
unreproducible test failures. I haven't exhausted every conceivable
avenue yet, but I thought I would write in hopes that others (Zack, past
Automake developers, anyone else ...) could give it a try, and/or have
some insights.

For me, running a parallel make check (with or without parallelizing the
"internal" makes), or make distcheck, fails some tests, e.g., nodef,
nodef2, testsuite-summary-reference-log. The exact tests that fail
changes from run to run. Running the tests on their own succeeds. Ok, so
it's something in the parallelism. But why? And how to debug?

Nothing has changed in the tests. Nothing has changed in the automake
infrastructure. Everything worked for me a few weeks ago. Furthermore,
Jim ran make check with much more parallelism than my machine can
muster, and everything succeeded for him. That was with:
  make check TESTSUITEFLAGS=-j20

So what the heck?

Perhaps easier to debug: there are two targets to be run before making a
release, check-no-trailing-backslash-in-recipes and check-cc-no-c-o,
to try to ensure no reversion wrt these features. A special shell and
compiler are configured, respectively (shell scripts that check the
behavior).

These always worked before. But now, Jim gets hundreds of failures with
the first (didn't have time to try the second). I get a couple, with
both, instead of hundreds. Again the failing tests vary. In this case,
they fail for me even without parallelism.

So what the heck x 2?

Any ideas, directions, fixes, greatly appreciated. --thanks, karl.