Re: parallel autotest revisited

Eric Blake Tue, 07 Jul 2009 13:34:00 -0700

Ralf Wildenhues <Ralf.Wildenhues <at> gmx.de> writes:

Hi Ralf,


> Thank you very much for working on this!

And thanks for your review.  If nothing else, I've learned a lot more about 
named pipes through this ordeal.

> 
> > It doesn't reliably fix things for cygwin 1.5 [... but there is hope
> > for 1.7 ... ]
> 
> Thanks for investigating.  Do we try to cater to cygwin 1.5 users
> somehow, or just document that they shouldn't try parallel autotest?

Cygwin 1.5 development is dead, but it may be a couple years before it is no 
longer an actively used configuration (it is the last version of cygwin that 
runs on Windows 98, for example).  Part of me is okay with the idea of just 
documenting that cygwin 1.5 users should not expect parallel autotest to work.  
Or we could add this code, which so far appears to reliably sniff out the 
cygwin 1.5 bug without leaving a hung process on either cygwin or on proper 
systems (with cygwin, the echo occurs because the exec didn't block):

mkfifo fifo
test "x`(exec 5>fifo; echo oops)& sleep 1; (kill -9 $! ) 2>/dev/null`" = x

But it comes at the penalty of several forks and a non-parallel sleep all 
before starting the first test of any parallel test run (but at least we can 
factor it so that this test only occurs only if all other prerequisites of -j 
have passed).

I have not yet come up with any less-invasive test that can reliably flush out 
cygwin 1.5 fifo mishandling.

The verdict is still out on cgf being able to fix cygwin 1.7 to have named 
pipes obey POSIX, but I'm hoping it works out (his latest comment is quite 
telling:
| Just an update: I now have a headache.  Thanks again Microsoft for
| making it all so complicated.  This is YA example of a feature (Windows
| Named Pipes) which behave 98% of the way you'd expect it to work.  The
| 2% is the killer.
http://cygwin.com/ml/cygwin/2009-07/msg00274.html)

> 
> > Does this patch look correct?  Should I go ahead and apply it?
> 
> 1)
> AFAICS this has a race between the read open and the write open/exec of
> the fifo: the master can spawn 2 test group processes and read open/exec
> the fifo before any of the tests have write opened it.
> 
> This is no problem however: the scheduler should eventually start one of
> the tests, which then write opens the fifo, which allows the master to
> continue.

The POSIX rules for blocking when opening a named file are symmetric.  So the 
way I see it, either:

the test group(s) block waiting for the master

or:

the master blocks waiting for at least one test group

Either way, there is no loss in this scheduler race.  Both outcomes give the 
correct behavior (eventually, enough processes are scheduled so that both read 
and write sides are opened, and the block is removed).  However, it does mean 
that if the scheduler favors the test groups over the master, that it is 
possible for ALL of the test groups unblock at once (whereas if the master were 
to blindly read open the fifo every iteration of the loop, then the parallel 
test groups would be a bit more staggered, because the master would unblock 
only the first test group).

> 
> 2)
> However, if, say, all test groups are killed before any can write open
> the fifo, then the master may hang.  This is not very likely, of course:
> typically, the master will have received the same signal (shutdown, or
> terminal hangup, for example), and that will take it out of the hang.
> However, this is a newly introduced, albeit very minor, limitation.

Yes, I agree that there is a narrow window for a race here.  But thinking about 
it more, the only thing that I could come up with that might make a difference 
is to open yet another subshell background task prior to the 
$at_job_control_on, with the sole job of write opening the pipe before any of 
the test groups are created, and which is reaped later via $! rather than jobs, 
so that the master's process group will also own a write fifo fd.  On the other 
hand, like you say, if a signal has been sent to each process group to kill all 
the job-controlled children, it is likely that the same signal will also be 
sent to the master.  So I don't think it's worth worrying about.

> 
> 3)
> If OTOH the master never read opens/execs the fifo, then the test group
> processes will hang in their write open of the fifo.  This can happen
> with your patch if the test for $at_stop_file is true very early.  Also,
> the reading back of the remaining tokens will then try to read from a
> closed fd, leading to:
> | ./testsuite: line 2232: 6: Bad file descriptor
> 
> errors (you can reproduce this by removing `test -f "$at_stop_file" &&').
> A trivial fix for both issues is to move the test for the stop file
> after the block that read opens the fifo.  This move doesn't really
> matter for the other use that $at_first has (which, BTW, is anyway a bit
> not well defined in the parallel case, but that's really irrelevant).

I thought about opening the read end sooner when writing my patch.  In fact, as 
mentioned for item 1, I also thought about always blindly invoking 'exec 
AT_JOB_FIFO_FD<"$at_job_fifo"' immediately after spawning each test group 
rather than making it conditional on $at_first (which is reached only after n 
test groups had already been spawned).  The only reason I didn't was because 
there was already a check of $at_first that was easy to exploit, and it seemed 
a waste of system calls to make the shell repeatedly reopen the same fifo onto 
the same fd.  But your arguments about moving the open fifo to sooner are good, 
so I will gladly squash in your code motion.

> 
> I haven't tested this patch on other systems yet, but hope to get to
> that soon (that AIX box I still have feedback pending from was offline).
> This shouldn't keep you from applying it though, with above fix; the
> fact that we limit ourselves to bash and zsh is a big help here.

You can run the parallel tests with other shells, you just have to explicitly 
ask for them: make TESTSUITEFLAGS='TEST_PARALLEL_AUTOTEST=:'.

-- 
Eric Blake

Re: parallel autotest revisited

Reply via email to