On 27/10/2025 20:21, Rob Landley wrote:
On 10/25/25 15:33, Collin Funk wrote:
Pádraig Brady <[email protected]> writes:

Therefore, I think a behavior change is unavoidable if we want to
use
posix_spawn. Falling back to no compression on temporary files seems
reasonable to me.

It seems better to me TBH.
One could be running a very long sort and re-installing
compressors underneath or something, in which case
you'd want the sort to be immune and keep going.

The attached patch is the same as before, plus your changes. Ontop
of
that I added another NEWS entry, documentation, and a test case for if
--compress-program cannot be executed.

It looks great.
I've one more suggestion attached to track the last_result
in all cases, so that periodic issues were diagnosed.
For e.g. if the system went into resource constraints occasionally,
it would be better to diagnose each time that happened.

Thanks, I pushed it with that change [1].

I notice issues with valgrind which are not blocking but interesting...

where it's waiting for longer when interacting with compress program:

    $ time valgrind /bin/sort -S 1K --compress-program=gzip Makefile | wc -l
    real  0m28.762s
    user  0m55.544s
    sys   0m44.364s

    $ time valgrind src/sort -S 1K --compress-program=gzip Makefile | wc -l
    real  1m6.639s
    user  0m45.288s
    sys   0m35.937s

and also it fails completely in the --compress-program=missing case:

    $ valgrind src/sort -S 1K --compress-program=missing Makefile | wc -l
     Process terminating with default action of signal 13 (SIGPIPE)
        at 0x48E377E: __internal_syscall_cancel (cancellation.c:64)
        by 0x48E37A3: __syscall_cancel (cancellation.c:75)
        by 0x495E2BD: write (write.c:26)
        by 0x48DF1C4: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1241)
        by 0x48DD015: new_do_write (fileops.c:483)
        by 0x48DE140: _IO_do_write@@GLIBC_2.2.5 (fileops.c:460)
        by 0x48DD84F: _IO_file_close_it@@GLIBC_2.2.5 (fileops.c:136)
        by 0x48D09C4: fclose@@GLIBC_2.2.5 (iofclose.c:53)
        by 0x4046E4: xfclose (sort.c:992)
        by 0x4029D0: sort (sort.c:4151)
        by 0x4029D0: main (sort.c:4919)

Maybe valgrind's remapping of clone3 -> clone is triggering that?
https://bugs.kde.org/show_bug.cgi?id=420906
I might get time to look into it.

The SIGPIPE makes me think that posix_spawn is returning 0 as if it
suceeded, which is incorrect because 'missing' cannot be executed.

It's easy to get vfork()/exec() wrong in that case, but musl-libc does
it right: Have the parent create a CLOEXEC pipe() pair before the
vfork(), then have the child send a failure indicator through the pipe
after the exec fails. Otherwise the exec closes the pipe without sending
any data, which the parent treats as success.

This is a libc issue, not a caller issue.

Upon further investigation, this isn't a major bug in valgrind I think.
Its remapping is just triggering different behavior in sort.
valgrind does cause posix_spawn() to return success even when it can't exec(),
but that shouldn't matter too much.

Note the minor valgrind bug is already tracked at:
https://bugs.kde.org/show_bug.cgi?id=481679

The different behavior is actually indicating an issue with sort(1) I think.
With a failing compressor, sort(1) can exit in 3 ways:

1. directly from posix_spawn returning a failure
2. indirectly from wait(compressor) returning a failure
3. silently from write(failed_compressor) triggering a SIGPIPE

The last case is what worries me.
The failure isn't diagnosed. Also SIGPIPE is common and usually not diagnosed.
Whether 2 or 3 occurs depends on how much is written to the failed compressors.
I can trigger 3. 100% of the time here without valgrind
by increasing the buffer size (amount written) to 100K.

Consider for example:

  $ printf '%s\n' '#!/bin/sh' 'exit 1' > exit1 && chmod a+x exit1
  $ src/sort -S 100K --compress-program=./exit1 Makefile | head -n1
  $ echo $?
  0

No diagnostics or failure code etc.
We'll need to address that.

Reply via email to