Pádraig Brady <[email protected]> writes:

>> Therefore, I think a behavior change is unavoidable if we want to
>> use
>> posix_spawn. Falling back to no compression on temporary files seems
>> reasonable to me.
>
> It seems better to me TBH.
> One could be running a very long sort and re-installing
> compressors underneath or something, in which case
> you'd want the sort to be immune and keep going.
>
>> The attached patch is the same as before, plus your changes. Ontop
>> of
>> that I added another NEWS entry, documentation, and a test case for if
>> --compress-program cannot be executed.
>
> It looks great.
> I've one more suggestion attached to track the last_result
> in all cases, so that periodic issues were diagnosed.
> For e.g. if the system went into resource constraints occasionally,
> it would be better to diagnose each time that happened.

Thanks, I pushed it with that change [1].

> I notice issues with valgrind which are not blocking but interesting...
>
> where it's waiting for longer when interacting with compress program:
>
>   $ time valgrind /bin/sort -S 1K --compress-program=gzip Makefile | wc -l
>   real  0m28.762s
>   user  0m55.544s
>   sys   0m44.364s
>
>   $ time valgrind src/sort -S 1K --compress-program=gzip Makefile | wc -l
>   real  1m6.639s
>   user  0m45.288s
>   sys   0m35.937s
>
> and also it fails completely in the --compress-program=missing case:
>
>   $ valgrind src/sort -S 1K --compress-program=missing Makefile | wc -l
>    Process terminating with default action of signal 13 (SIGPIPE)
>       at 0x48E377E: __internal_syscall_cancel (cancellation.c:64)
>       by 0x48E37A3: __syscall_cancel (cancellation.c:75)
>       by 0x495E2BD: write (write.c:26)
>       by 0x48DF1C4: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1241)
>       by 0x48DD015: new_do_write (fileops.c:483)
>       by 0x48DE140: _IO_do_write@@GLIBC_2.2.5 (fileops.c:460)
>       by 0x48DD84F: _IO_file_close_it@@GLIBC_2.2.5 (fileops.c:136)
>       by 0x48D09C4: fclose@@GLIBC_2.2.5 (iofclose.c:53)
>       by 0x4046E4: xfclose (sort.c:992)
>       by 0x4029D0: sort (sort.c:4151)
>       by 0x4029D0: main (sort.c:4919)
>
> Maybe valgrind's remapping of clone3 -> clone is triggering that?
> https://bugs.kde.org/show_bug.cgi?id=420906
> I might get time to look into it.

The SIGPIPE makes me think that posix_spawn is returning 0 as if it
suceeded, which is incorrect because 'missing' cannot be executed.

Here are the clone calls when using --compress-program=gzip:

    $ strace valgrind src/sort -S 1K --compress-program=gzip \
        Makefile 2>&1 | grep -c ^clone
    7338

Note that those are all clone and not clone3, as you mentioned.

When using --compress-program=missing, valgrind exits right after the
clone:

    $ strace valgrind src/sort -S 1K --compress-program=missing \
        Makefile 2>&1
    [...]
    clone(child_stack=NULL, flags=CLONE_VFORK|SIGCHLD==376076== 
    ==376076== HEAP SUMMARY:
    ==376076==     in use at exit: 2,032 bytes in 18 blocks
    ==376076==   total heap usage: 70 allocs, 52 frees, 40,144 bytes allocated
    [...]

I agree it isn't blocking since we can still use ASAN and UBSAN. But we
should certainly try to get it fixed.

Collin

[1] 
https://github.com/coreutils/coreutils/commit/2b3eb795c85633a4b808df44d19a4a9be5976c10

Reply via email to