Re: double-fork issue on Windows on ARM64
On Mon, 20 May 2024, Jeremy Drake wrote: > Today, I was attempting to look at the TerminateThread situation. The > call in question comes from the attempt to terminate the wait_thread of a > chld_procs entry. I noticed elsewhere in cygwin code (flock.cc) that > CancelSynchronousIo was being called, and that stood out to me because > chances are that the wait thread (if running) is going to be blocked in > ReadFile. I am testing with the following hack, and so far have not seen > a hang I left my reproducer running with this hack, and I did eventually get an error exit from the intermediate subprocess, which seems to have been a signal 11 (if I'm reading the status from waitpid correctly). What I noticed today is that in pinfo.cc, near the end of proc_waiter, it sets vchild.wait_thread = NULL;. If my reading of this is correct, that does nothing useful, because vchild is a stack variable there and the function returns soon after. I that what that *intended* to do was to NULL out the wait_thread pointer that would be checked in proc_terminate, but there's no guarantee that the entry in chld_procs is in the same place at the end of proc_waiter as it was at the start (so arg may point to some other pinfo entirely). Does any of this make any sense, or am I barking up the wrong tree here?
Re: double-fork issue on Windows on ARM64
On Wed, 8 May 2024, Jeremy Drake wrote: > (this is the same issue discussed in > https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html) > > On MSYS2, running on Windows on ARM64 only, we've been plagued by issues > with processes hanging up. Usually pacman, when it is trying to validate > signatures with gpgme. When a process is hung in this way, no debugger > seems to be able to attach properly. > > > anecdotally, the hang occurs when _exit() calls > > proc_terminate() which is then blocked by a call to TerminateThread() > > with an invalid thread handle (for more details, see > > https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-1951796327). As a follow-up to this, that was from a proposed workaround of just commenting out the double-fork behavior in gpgme. After reading a comment in the code and doing some research online, it seems the double-fork is an accepted idiom on posix to avoid having to wait for the (grand)child, without creating zombie processes. I was unable to see zombie processes in ps or /proc/, but I did see extra cygpid.* entries in /proc/sys/BaseNamedObjects/cygwin* which seem to be much the same thing. Today, I was attempting to look at the TerminateThread situation. The call in question comes from the attempt to terminate the wait_thread of a chld_procs entry. I noticed elsewhere in cygwin code (flock.cc) that CancelSynchronousIo was being called, and that stood out to me because chances are that the wait thread (if running) is going to be blocked in ReadFile. I am testing with the following hack, and so far have not seen a hang: diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc index 86e4e607ab..020906d797 100644 --- a/winsup/cygwin/sigproc.cc +++ b/winsup/cygwin/sigproc.cc @@ -410,7 +410,7 @@ proc_terminate () if (!have_execed || !have_execed_cygwin) chld_procs[i]->ppid = 1; if (chld_procs[i].wait_thread) - chld_procs[i].wait_thread->terminate_thread (); + CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ()); /* Release memory associated with this process unless it is 'myself'. 'myself' is only in the chld_procs table when we've execed. We reach here when the next process has finished initializing but we As a disclaimer, I am having a hard time wrapping my head around this code, so I don't know what kind of side-effects this may have, but it does seem to help the hang, without resulting in "zombie" cygpid entries. (Note that I first tried + if (CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ())) + chld_procs[i].wait_thread->detach (); + else + chld_procs[i].wait_thread->terminate_thread (); but that resulted in a (debuggable) hang in detach, because the cygthread::stub was waiting for thread_sync, while cygthread::detach was waiting for *this. That appears to be because this is an auto-releasing cygthread. It kind of bothers me that there is no synchronization to be sure the wait_thread is done shutting down before moving on in proc_terminate, but I don't see an obvious way in the current structure).
double-fork issue on Windows on ARM64
(this is the same issue discussed in https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html) On MSYS2, running on Windows on ARM64 only, we've been plagued by issues with processes hanging up. Usually pacman, when it is trying to validate signatures with gpgme. When a process is hung in this way, no debugger seems to be able to attach properly. After many months of off-and-on progress trying to debug this, we've *finally* got an idea of what behavior is causing this, and a standalone reproducer that runs on Cygwin. > A common symptom is that the hanging process has a command-line that is > identical to its parent process' command-line (indicating that it has > been fork()ed), and anecdotally, the hang occurs when _exit() calls > proc_terminate() which is then blocked by a call to TerminateThread() > with an invalid thread handle (for more details, see > https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-1951796327). > > In my tests, I found that the hanging process is spawned from > _gpgme_io_spawn() which lets the child process immediately spawn another > child. That seems like a fantastic way to find timing-related bugs in > the MSYS2/Cygwin runtime. > > As a work-around, it does seem to help if we avoid that double-fork. That led me to make the attached reproducer, which is based on the code from _gpgme_io_spawn. I originally expected that this would require some timing adjustment, hence the defines to change the binary and argument (I expected to use /bin/sleep and different values). It turns out, this reproduces readily with /bin/true. I build this with `gcc -ggdb -o testfork testfork.c`, and this reproduces: * on a Raspberry PI 4 running Windows 10, with an i686 msys2 runtime * on a QC710 running Windows 11 23H2, with x86_64 msys2 runtime (this seems to reproduce it most readily). * on a hyper-v virtual machine on Dev Kit 2023 running Windows 11 23H2, with x86_64 msys2 runtime or Cygwin 3.5.3. This seems to require running two instances of testfork.exe at the same time. When attaching to the hung process, gdb shows (gdb) i thr Id Target IdFrame 1Thread 6516.0xbe8error return /cygdrive/d/a/scallywag/gdb/gdb-13.2-1.x86_64/src/gdb-13.2/gdb/windows-nat.c:748 was 31: A device attached to the system is not functioning. 0x in ?? () 2Thread 6516.0x1b28 "sig" 0x7ff8051a8a64 in ?? () * 3Thread 6516.0x12b4 0x7ff8051b4374 in ?? () Let me know if I can provide any additional info, or anything else we can try to help debug this.#include #include #include #ifndef BINARY #define BINARY "/bin/true" #endif #ifndef ARG #define ARG "0.1" #endif int main(int argc, char ** argv) { while (1) { int pid; printf("Starting group of 100x " BINARY " " ARG "\n"); for (int i = 0; i < 100; ++i) { pid = fork(); if (pid == -1) { perror("fork error"); return 1; } else if (pid == 0) { if ((pid = fork()) == 0) { char * const args[] = {BINARY, ARG, NULL}; execv(BINARY, args); perror("execv failed"); _exit(5); } if (pid == -1) { perror("inner fork error"); _exit(1); } else { _exit(0); } } else { int status; if (waitpid(pid, , 0) == -1) { perror("waitpid error"); return 2; } else if (status != 0) { fprintf(stderr, "subprocess exited non-zero: %d\n", status); return WEXITSTATUS(status); } } } } return 0; }