On 30/10/2025 06:00, Collin Funk wrote:
Hi Pádraig,
Pádraig Brady <[email protected]> writes:
On 29/10/2025 04:17, Collin Funk wrote:
Bruno Haible <[email protected]> writes:
The sort/sort-compress test fails in yesterday's CI (Oct 27)
and was OK a week ago (Oct 20). Most likely caused by commit
2b3eb795c85633a4b808df44d19a4a9be5976c10.
It fails on
- CentOS 7,
- macOS 13..26,
- FreeBSD 14.0,
- OpenBSD 7.7,
- Solaris 11.4.
Find attached the log files.
I think I see what is happening. Pádraig and I wanted to continue if
the
compress program could not be executed. However, we wrote it in a way
that depends on the return value of posix_spawn being an error. This
works on modern GNU/Linux because of clone, but not on other systems
using fork + exec. The child process has no way of returning ENOENT to
the parent process in posix_spawn if exec fails. If we wanted to do that
portably we would have to wait for the child and check that it's exit
status is 127, then fallback to an uncompressed file if so.
Probably best to restrict the test to modern GLIBC systems so.
To do that have a look at the skip logic in tests/seq/seq-long-double.sh
I don't think we should have 'sort' behave differently on GNU/Linux and
OpenBSD, for example. This NEWS entry:
'sort' will continue without compressing temporary files if the
program specified by --compress-program cannot be executed.
is only true on systems with posix_spawn implemented through clone or
something similar.
On GNU/Linux:
$ ./src/sort -S 1k --compress-program=missing Makefile | wc -l
sort: could not run compress program 'missing': No such file or directory
30278
On OpenBSD:
$ ./src/sort -S 1k --compress-program=missing Makefile
sort: close failed: /tmp/sortVFCThm: Broken pipe
One could fix that by checking WIFEXITED (pid) and
WEXITSTATUS (status) == 127, but that check is not perfect because
posix_spawn makes the child exit with 127 on any failure (*).
The 3 ideas I had were the following:
1. Call wait () with WNOHANG on the child process and check that
WIFEXITED() && WEXITSTATUS() == 127. This does not work since the child
might be in between fork'ing and exec'ing.
2. Call wait () without WNOHANG. This is obviously bad since it
blocks the parent process until each child finishes.
3. Wait for a write error and then check that WEXITSTATUS() == 127
before falling back to an uncompressed temporary file. This is bad since
we require having 2x more file descriptors open.
(*) For this reason, I am probably going to revert the 'timeout' change.
It is impossible for 'timeout' to check if it should exit with a status
of 126 or 127 (at least correctly).
The call to posix_spawn will succeed if a program cannot be found or if
the program can be found but is not executable. In both cases, the child
exits with a status of 127. However, 'timeout' should exit with a status
126 if the program exists but is not executable. The only way to check
for this is to search for the program in the parent and check it's
permissions, if it exists. This is incorrect since the directory entry
identified by 'timeout' might change, or have it's permissions change,
in between the child process calling exec() and the time 'timeout'
begins looking for it.
It is unfortunate that POSIX didn't standardize better error reporting
for posix_spawn. There is some discussion on their rational [2], but it
limits it's usability quite a lot, in my opinion...
Collin
[1] https://lists.gnu.org/archive/html/bug-gnulib/2025-10/msg00122.html
[2] https://pubs.opengroup.org/onlinepubs/9799919799/functions/posix_spawn.html
Note the conflation of exit statuses to 127 is not the case on glibc,
as posix_spawn() returns ENOENT appropriately there.
But yes, that is in contrast to the description in the posix_spawn man page
which says 127 is returned for any child process issues.
Also the general cross platform / emulated case is that
the conflation of exit status can occur.
Another edge case is the different fallbacks of execvp and posix_spawnp.
execvp falls back to retrying the exec with /bin/sh ... in the case
the original exec fails with ENOEXEC, while posix_spawnp does not.
For example:
$ printf '%s\n' 'exec zstd -1 "$@"' > zstd1 # note no shebang line
$ chmod a+x zstd1
$ timeout 0 ./zstd1 --version
*** Zstandard CLI (64-bit) v1.5.7, by Yann Collet ***
$ src/timeout 0 ./zstd1 --version
timeout: failed to run command ‘./zstd1’: Exec format error
$ echo $?
126
So that would be another reason not to use posix_spawn() in timeout.
Since we get little to no benefit due to timeout(1) not using
significant memory itself, I'd also be inclined to revert the change,
replacing with a comment as to why we're not using posix_spawn.
Note the above ENOEXEC fallback issue also impacts sort --compress,
but that's a less general use case so I don't think it's worth
worrying about or changing anything for. We could document that
change in behavior though in NEWS.
cheers,
Padraig