On 25/10/2025 04:17, Collin Funk wrote:
Pádraig Brady <[email protected]> writes:

Here is the behavior without your proposed change:
      $ ./src/sort -S 1K --compress-program=missing input | wc -l
      sort: could not run compress program 'missing': No such file or directory
      0
      $ ./src/sort -S 1K --compress-program=missing input > /dev/null
      $ echo $?
      2
Here is the behavior with your proposed change:
      $ ./src/sort -S 1K --compress-program=missing input | wc -l
      sort: could not run compress program 'missing': No such file or directory
      100000
      $ ./src/sort -S 1K --compress-program=missing input > /dev/null
      sort: could not run compress program 'missing': No such file or directory
      $ echo $?
      0
That seems reasonable to me. But should the exit status be changed?
If
the warning is printed to standard error before the sorted lines are
written to standard output, I worry it will never be seen.

Yes that's a fair point re writing to stderr while exiting success.
Though in this case posix_spawn() is conflating many possible errors,
essentially all of which are non fatal, though you would want some
indication of them since they're non usual.
Given we minimize the duplicate errors I'm OK with the warnings.
You want some indication if the use typos the --compress-program at least.
I suppose you could special case certain errnos, but in this case
I think it might be ok to just warn.

Technically this is a change in behavior, so I suppose one would need
to change coreutils.texi describe --compress-program more accurately.
Alternatively one could warn with EAGAIN/ENOMEM and fail otherwise,
but I'm inclined to keep the logic and change the docs,
so that we're consistent over all errnos.

I thought about setting *pid to -1 at the start of pipe_child and then
checking it in the calling function to see if fork succeeded. That was
my idea to preserve the current behavior. But that does not work for the
glibc implementation which uses clone, since there is no separate fork
and exec call.

The larger issue with that idea is that POSIX leaves the value of pid
passed as the first argument to posix_spawn undefined if the function
call fails [1]:

     Otherwise [if posix_spawn fails], no child process shall be created,
     the value stored into the variable pointed to by a non-null pid is
     unspecified, ...

Therefore, I think a behavior change is unavoidable if we want to use
posix_spawn. Falling back to no compression on temporary files seems
reasonable to me.

It seems better to me TBH.
One could be running a very long sort and re-installing
compressors underneath or something, in which case
you'd want the sort to be immune and keep going.


The attached patch is the same as before, plus your changes. Ontop of
that I added another NEWS entry, documentation, and a test case for if
--compress-program cannot be executed.

It looks great.
I've one more suggestion attached to track the last_result
in all cases, so that periodic issues were diagnosed.
For e.g. if the system went into resource constraints occasionally,
it would be better to diagnose each time that happened.

I notice issues with valgrind which are not blocking but interesting...

where it's waiting for longer when interacting with compress program:

  $ time valgrind /bin/sort -S 1K --compress-program=gzip Makefile | wc -l
  real  0m28.762s
  user  0m55.544s
  sys   0m44.364s

  $ time valgrind src/sort -S 1K --compress-program=gzip Makefile | wc -l
  real  1m6.639s
  user  0m45.288s
  sys   0m35.937s

and also it fails completely in the --compress-program=missing case:

  $ valgrind src/sort -S 1K --compress-program=missing Makefile | wc -l
   Process terminating with default action of signal 13 (SIGPIPE)
      at 0x48E377E: __internal_syscall_cancel (cancellation.c:64)
      by 0x48E37A3: __syscall_cancel (cancellation.c:75)
      by 0x495E2BD: write (write.c:26)
      by 0x48DF1C4: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1241)
      by 0x48DD015: new_do_write (fileops.c:483)
      by 0x48DE140: _IO_do_write@@GLIBC_2.2.5 (fileops.c:460)
      by 0x48DD84F: _IO_file_close_it@@GLIBC_2.2.5 (fileops.c:136)
      by 0x48D09C4: fclose@@GLIBC_2.2.5 (iofclose.c:53)
      by 0x4046E4: xfclose (sort.c:992)
      by 0x4029D0: sort (sort.c:4151)
      by 0x4029D0: main (sort.c:4919)

Maybe valgrind's remapping of clone3 -> clone is triggering that?
https://bugs.kde.org/show_bug.cgi?id=420906
I might get time to look into it.

Anyway without valgrind perf is slightly better with posix_spawn:

  $ time src/sort -S 1K --compress-program=gzip Makefile >/dev/null
  real  0m1.608s
  user  0m2.323s
  sys   0m2.438s

  $ time /bin/sort -S 1K --compress-program=gzip Makefile >/dev/null
  real  0m1.844s
  user  0m2.454s
  sys   0m3.654s

thanks!
Padraig
diff --git a/src/sort.c b/src/sort.c
index 6a8004904..7127f671b 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -1115,6 +1115,8 @@ pipe_child (pid_t *pid, int pipefds[2], int tempfd, bool decompress,
         break;
       else
         {
+          /* [v]fork/clone are indicating resource constraints,
+             so back-off for a while and retry.  */
           xnanosleep (wait_retry);
           wait_retry *= 2;
           reap_exited ();
@@ -1164,7 +1166,6 @@ maybe_create_temp (FILE **pfp, bool survive_fd_exhaustion)
           if (result != last_result)
             error (0, result, _("could not run compress program %s"),
                    quoteaf (compress_program));
-          last_result = result;
         }
       else
         {
@@ -1174,6 +1175,8 @@ maybe_create_temp (FILE **pfp, bool survive_fd_exhaustion)
 
           register_proc (node);
         }
+
+      last_result = result;
     }
 
   *pfp = fdopen (tempfd, "w");

Reply via email to