Hi Pádraig, Pádraig Brady <[email protected]> writes:
> On 29/10/2025 04:17, Collin Funk wrote: >> Bruno Haible <[email protected]> writes: >> >>> The sort/sort-compress test fails in yesterday's CI (Oct 27) >>> and was OK a week ago (Oct 20). Most likely caused by commit >>> 2b3eb795c85633a4b808df44d19a4a9be5976c10. >>> >>> It fails on >>> - CentOS 7, >>> - macOS 13..26, >>> - FreeBSD 14.0, >>> - OpenBSD 7.7, >>> - Solaris 11.4. >>> >>> Find attached the log files. >> I think I see what is happening. Pádraig and I wanted to continue if >> the >> compress program could not be executed. However, we wrote it in a way >> that depends on the return value of posix_spawn being an error. This >> works on modern GNU/Linux because of clone, but not on other systems >> using fork + exec. The child process has no way of returning ENOENT to >> the parent process in posix_spawn if exec fails. If we wanted to do that >> portably we would have to wait for the child and check that it's exit >> status is 127, then fallback to an uncompressed file if so. > > Probably best to restrict the test to modern GLIBC systems so. > To do that have a look at the skip logic in tests/seq/seq-long-double.sh I don't think we should have 'sort' behave differently on GNU/Linux and OpenBSD, for example. This NEWS entry: 'sort' will continue without compressing temporary files if the program specified by --compress-program cannot be executed. is only true on systems with posix_spawn implemented through clone or something similar. On GNU/Linux: $ ./src/sort -S 1k --compress-program=missing Makefile | wc -l sort: could not run compress program 'missing': No such file or directory 30278 On OpenBSD: $ ./src/sort -S 1k --compress-program=missing Makefile sort: close failed: /tmp/sortVFCThm: Broken pipe One could fix that by checking WIFEXITED (pid) and WEXITSTATUS (status) == 127, but that check is not perfect because posix_spawn makes the child exit with 127 on any failure (*). The 3 ideas I had were the following: 1. Call wait () with WNOHANG on the child process and check that WIFEXITED() && WEXITSTATUS() == 127. This does not work since the child might be in between fork'ing and exec'ing. 2. Call wait () without WNOHANG. This is obviously bad since it blocks the parent process until each child finishes. 3. Wait for a write error and then check that WEXITSTATUS() == 127 before falling back to an uncompressed temporary file. This is bad since we require having 2x more file descriptors open. (*) For this reason, I am probably going to revert the 'timeout' change. It is impossible for 'timeout' to check if it should exit with a status of 126 or 127 (at least correctly). The call to posix_spawn will succeed if a program cannot be found or if the program can be found but is not executable. In both cases, the child exits with a status of 127. However, 'timeout' should exit with a status 126 if the program exists but is not executable. The only way to check for this is to search for the program in the parent and check it's permissions, if it exists. This is incorrect since the directory entry identified by 'timeout' might change, or have it's permissions change, in between the child process calling exec() and the time 'timeout' begins looking for it. It is unfortunate that POSIX didn't standardize better error reporting for posix_spawn. There is some discussion on their rational [2], but it limits it's usability quite a lot, in my opinion... Collin [1] https://lists.gnu.org/archive/html/bug-gnulib/2025-10/msg00122.html [2] https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html
