Paul Eggert wrote: > I found and fixed some bugs and simplified some code > while trying (and so far, failing) to fix the hang reported at the end of > <http://lists.gnu.org/archive/html/bug-coreutils/2010-12/msg00023.html>. > I installed the following, and plan to look into that hang some more. > > * src/sort.c (uintptr): New type. > (enum procstate, struct procnode, update_proc): Remove. > (proctab_hasher, proctab_comparator, register_proc, wait_proc): > (reap_some): The proctab is now simply a hash of process-IDs > rather than of pointers to objects with reference counts and > states; this is smaller and faster and easier to understand. > (nprocs): Now pid_t, not size_t, since one cannot have more than > PID_MAX children. > (reap): If the argument is -1, wait; if 0 (a new value), do not. > Delete pid from proctab as needed. Ignore children that are not > in proctab, as they are from the program that exec'ed us and are > irrelevant to our success or failure. > (delete_proc, reap_all): New functions. > (open_temp): Register the child. > (sort): Clean up all children afterwards; without this patch, > 'sort' sometimes missed failures in children due to race conditions. > * tests/Makefile.am (TESTS): Add misc/sort-compress-proc. > * tests/misc/sort-compress-proc: New file, to test for the > bugs fixed above.
Nice clean-up and fix. Thanks for the patch. I'm glad to know that you're still on it. ... > diff --git a/tests/misc/sort-compress-proc b/tests/misc/sort-compress-proc ... > +. "${srcdir=.}/init.sh"; path_prepend_ ../src > +print_ver_ sort > +expensive_ > + > +# Ensure that $TMPDIR is valid. > +if test -d /tmp && test -w /tmp > +then TMPDIR=/tmp > +else TMPDIR=. > +fi > +export TMPDIR You may remove the TMPDIR-manipulating code above since tests/check.mk already ensures that TMPDIR is a directory: TESTS_ENVIRONMENT = \ . $(srcdir)/lang-default; \ tmp__=$$TMPDIR; test -d "$$tmp__" || tmp__=.; \ . $(srcdir)/envvar-check; \ TMPDIR=$$tmp__; export TMPDIR; \ But as you can see, it does not ensure it's writable, while your code did, so moving your conjunct to check.mk would be an improvement. ... > +# Give compression subprocesses time to react when 'sort' exits. > +# Otherwise, under NFS, when 'sort' unlinks the temp files and they > +# are renamed to .nfsXXXX instead of being removed, the parent cleanup > +# of this directory will fail because the files are still open. > +sleep 1 It'd be a shame to make everyone sleep even one second for NFS. And on a heavily loaded system, 1 second might not be enough. Can you formulate that as a loop that sleeps say 0.1 seconds at a time for up to 5 seconds and that checks whether it's safe to exit? > +Exit $fail