On 25/10/2025 04:17, Collin Funk wrote:
Pádraig Brady <[email protected]> writes:
Here is the behavior without your proposed change:
$ ./src/sort -S 1K --compress-program=missing input | wc -l
sort: could not run compress program 'missing': No such file or directory
0
$ ./src/sort -S 1K --compress-program=missing input > /dev/null
$ echo $?
2
Here is the behavior with your proposed change:
$ ./src/sort -S 1K --compress-program=missing input | wc -l
sort: could not run compress program 'missing': No such file or directory
100000
$ ./src/sort -S 1K --compress-program=missing input > /dev/null
sort: could not run compress program 'missing': No such file or directory
$ echo $?
0
That seems reasonable to me. But should the exit status be changed?
If
the warning is printed to standard error before the sorted lines are
written to standard output, I worry it will never be seen.
Yes that's a fair point re writing to stderr while exiting success.
Though in this case posix_spawn() is conflating many possible errors,
essentially all of which are non fatal, though you would want some
indication of them since they're non usual.
Given we minimize the duplicate errors I'm OK with the warnings.
You want some indication if the use typos the --compress-program at least.
I suppose you could special case certain errnos, but in this case
I think it might be ok to just warn.
Technically this is a change in behavior, so I suppose one would need
to change coreutils.texi describe --compress-program more accurately.
Alternatively one could warn with EAGAIN/ENOMEM and fail otherwise,
but I'm inclined to keep the logic and change the docs,
so that we're consistent over all errnos.
I thought about setting *pid to -1 at the start of pipe_child and then
checking it in the calling function to see if fork succeeded. That was
my idea to preserve the current behavior. But that does not work for the
glibc implementation which uses clone, since there is no separate fork
and exec call.
The larger issue with that idea is that POSIX leaves the value of pid
passed as the first argument to posix_spawn undefined if the function
call fails [1]:
Otherwise [if posix_spawn fails], no child process shall be created,
the value stored into the variable pointed to by a non-null pid is
unspecified, ...
Therefore, I think a behavior change is unavoidable if we want to use
posix_spawn. Falling back to no compression on temporary files seems
reasonable to me.
It seems better to me TBH.
One could be running a very long sort and re-installing
compressors underneath or something, in which case
you'd want the sort to be immune and keep going.
The attached patch is the same as before, plus your changes. Ontop of
that I added another NEWS entry, documentation, and a test case for if
--compress-program cannot be executed.
It looks great.
I've one more suggestion attached to track the last_result
in all cases, so that periodic issues were diagnosed.
For e.g. if the system went into resource constraints occasionally,
it would be better to diagnose each time that happened.
I notice issues with valgrind which are not blocking but interesting...
where it's waiting for longer when interacting with compress program:
$ time valgrind /bin/sort -S 1K --compress-program=gzip Makefile | wc -l
real 0m28.762s
user 0m55.544s
sys 0m44.364s
$ time valgrind src/sort -S 1K --compress-program=gzip Makefile | wc -l
real 1m6.639s
user 0m45.288s
sys 0m35.937s
and also it fails completely in the --compress-program=missing case:
$ valgrind src/sort -S 1K --compress-program=missing Makefile | wc -l
Process terminating with default action of signal 13 (SIGPIPE)
at 0x48E377E: __internal_syscall_cancel (cancellation.c:64)
by 0x48E37A3: __syscall_cancel (cancellation.c:75)
by 0x495E2BD: write (write.c:26)
by 0x48DF1C4: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1241)
by 0x48DD015: new_do_write (fileops.c:483)
by 0x48DE140: _IO_do_write@@GLIBC_2.2.5 (fileops.c:460)
by 0x48DD84F: _IO_file_close_it@@GLIBC_2.2.5 (fileops.c:136)
by 0x48D09C4: fclose@@GLIBC_2.2.5 (iofclose.c:53)
by 0x4046E4: xfclose (sort.c:992)
by 0x4029D0: sort (sort.c:4151)
by 0x4029D0: main (sort.c:4919)
Maybe valgrind's remapping of clone3 -> clone is triggering that?
https://bugs.kde.org/show_bug.cgi?id=420906
I might get time to look into it.
Anyway without valgrind perf is slightly better with posix_spawn:
$ time src/sort -S 1K --compress-program=gzip Makefile >/dev/null
real 0m1.608s
user 0m2.323s
sys 0m2.438s
$ time /bin/sort -S 1K --compress-program=gzip Makefile >/dev/null
real 0m1.844s
user 0m2.454s
sys 0m3.654s
thanks!
Padraigdiff --git a/src/sort.c b/src/sort.c
index 6a8004904..7127f671b 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -1115,6 +1115,8 @@ pipe_child (pid_t *pid, int pipefds[2], int tempfd, bool decompress,
break;
else
{
+ /* [v]fork/clone are indicating resource constraints,
+ so back-off for a while and retry. */
xnanosleep (wait_retry);
wait_retry *= 2;
reap_exited ();
@@ -1164,7 +1166,6 @@ maybe_create_temp (FILE **pfp, bool survive_fd_exhaustion)
if (result != last_result)
error (0, result, _("could not run compress program %s"),
quoteaf (compress_program));
- last_result = result;
}
else
{
@@ -1174,6 +1175,8 @@ maybe_create_temp (FILE **pfp, bool survive_fd_exhaustion)
register_proc (node);
}
+
+ last_result = result;
}
*pfp = fdopen (tempfd, "w");