Ralf Wildenhues <Ralf.Wildenhues <at> gmx.de> writes: Hi Ralf,
> Thank you very much for working on this! And thanks for your review. If nothing else, I've learned a lot more about named pipes through this ordeal. > > > It doesn't reliably fix things for cygwin 1.5 [... but there is hope > > for 1.7 ... ] > > Thanks for investigating. Do we try to cater to cygwin 1.5 users > somehow, or just document that they shouldn't try parallel autotest? Cygwin 1.5 development is dead, but it may be a couple years before it is no longer an actively used configuration (it is the last version of cygwin that runs on Windows 98, for example). Part of me is okay with the idea of just documenting that cygwin 1.5 users should not expect parallel autotest to work. Or we could add this code, which so far appears to reliably sniff out the cygwin 1.5 bug without leaving a hung process on either cygwin or on proper systems (with cygwin, the echo occurs because the exec didn't block): mkfifo fifo test "x`(exec 5>fifo; echo oops)& sleep 1; (kill -9 $! ) 2>/dev/null`" = x But it comes at the penalty of several forks and a non-parallel sleep all before starting the first test of any parallel test run (but at least we can factor it so that this test only occurs only if all other prerequisites of -j have passed). I have not yet come up with any less-invasive test that can reliably flush out cygwin 1.5 fifo mishandling. The verdict is still out on cgf being able to fix cygwin 1.7 to have named pipes obey POSIX, but I'm hoping it works out (his latest comment is quite telling: | Just an update: I now have a headache. Thanks again Microsoft for | making it all so complicated. This is YA example of a feature (Windows | Named Pipes) which behave 98% of the way you'd expect it to work. The | 2% is the killer. http://cygwin.com/ml/cygwin/2009-07/msg00274.html) > > > Does this patch look correct? Should I go ahead and apply it? > > 1) > AFAICS this has a race between the read open and the write open/exec of > the fifo: the master can spawn 2 test group processes and read open/exec > the fifo before any of the tests have write opened it. > > This is no problem however: the scheduler should eventually start one of > the tests, which then write opens the fifo, which allows the master to > continue. The POSIX rules for blocking when opening a named file are symmetric. So the way I see it, either: the test group(s) block waiting for the master or: the master blocks waiting for at least one test group Either way, there is no loss in this scheduler race. Both outcomes give the correct behavior (eventually, enough processes are scheduled so that both read and write sides are opened, and the block is removed). However, it does mean that if the scheduler favors the test groups over the master, that it is possible for ALL of the test groups unblock at once (whereas if the master were to blindly read open the fifo every iteration of the loop, then the parallel test groups would be a bit more staggered, because the master would unblock only the first test group). > > 2) > However, if, say, all test groups are killed before any can write open > the fifo, then the master may hang. This is not very likely, of course: > typically, the master will have received the same signal (shutdown, or > terminal hangup, for example), and that will take it out of the hang. > However, this is a newly introduced, albeit very minor, limitation. Yes, I agree that there is a narrow window for a race here. But thinking about it more, the only thing that I could come up with that might make a difference is to open yet another subshell background task prior to the $at_job_control_on, with the sole job of write opening the pipe before any of the test groups are created, and which is reaped later via $! rather than jobs, so that the master's process group will also own a write fifo fd. On the other hand, like you say, if a signal has been sent to each process group to kill all the job-controlled children, it is likely that the same signal will also be sent to the master. So I don't think it's worth worrying about. > > 3) > If OTOH the master never read opens/execs the fifo, then the test group > processes will hang in their write open of the fifo. This can happen > with your patch if the test for $at_stop_file is true very early. Also, > the reading back of the remaining tokens will then try to read from a > closed fd, leading to: > | ./testsuite: line 2232: 6: Bad file descriptor > > errors (you can reproduce this by removing `test -f "$at_stop_file" &&'). > A trivial fix for both issues is to move the test for the stop file > after the block that read opens the fifo. This move doesn't really > matter for the other use that $at_first has (which, BTW, is anyway a bit > not well defined in the parallel case, but that's really irrelevant). I thought about opening the read end sooner when writing my patch. In fact, as mentioned for item 1, I also thought about always blindly invoking 'exec AT_JOB_FIFO_FD<"$at_job_fifo"' immediately after spawning each test group rather than making it conditional on $at_first (which is reached only after n test groups had already been spawned). The only reason I didn't was because there was already a check of $at_first that was easy to exploit, and it seemed a waste of system calls to make the shell repeatedly reopen the same fifo onto the same fd. But your arguments about moving the open fifo to sooner are good, so I will gladly squash in your code motion. > > I haven't tested this patch on other systems yet, but hope to get to > that soon (that AIX box I still have feedback pending from was offline). > This shouldn't keep you from applying it though, with above fix; the > fact that we limit ourselves to bash and zsh is a big help here. You can run the parallel tests with other shells, you just have to explicitly ask for them: make TESTSUITEFLAGS='TEST_PARALLEL_AUTOTEST=:'. -- Eric Blake
