(this is the same issue discussed in
https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html)

On MSYS2, running on Windows on ARM64 only, we've been plagued by issues
with processes hanging up.  Usually pacman, when it is trying to validate
signatures with gpgme.  When a process is hung in this way, no debugger
seems to be able to attach properly.

After many months of off-and-on progress trying to debug this, we've
*finally* got an idea of what behavior is causing this, and a standalone
reproducer that runs on Cygwin.

> A common symptom is that the hanging process has a command-line that is
> identical to its parent process' command-line (indicating that it has
> been fork()ed), and anecdotally, the hang occurs when _exit() calls
> proc_terminate() which is then blocked by a call to TerminateThread()
> with an invalid thread handle (for more details, see
> https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-1951796327).
>
> In my tests, I found that the hanging process is spawned from
> _gpgme_io_spawn() which lets the child process immediately spawn another
> child. That seems like a fantastic way to find timing-related bugs in
> the MSYS2/Cygwin runtime.
>
> As a work-around, it does seem to help if we avoid that double-fork.

That led me to make the attached reproducer, which is based on the code
from _gpgme_io_spawn.  I originally expected that this would require some
timing adjustment, hence the defines to change the binary and argument (I
expected to use /bin/sleep and different values).  It turns out, this
reproduces readily with /bin/true.

I build this with `gcc -ggdb -o testfork testfork.c`, and this reproduces:
* on a Raspberry PI 4 running Windows 10, with an i686 msys2 runtime
* on a QC710 running Windows 11 23H2, with x86_64 msys2 runtime (this
seems to reproduce it most readily).
* on a hyper-v virtual machine on Dev Kit 2023 running Windows 11 23H2,
with x86_64 msys2 runtime or Cygwin 3.5.3.  This seems to require running
two instances of testfork.exe at the same time.

When attaching to the hung process, gdb shows
(gdb) i thr
  Id   Target Id                Frame
  1    Thread 6516.0xbe8        error return
/cygdrive/d/a/scallywag/gdb/gdb-13.2-1.x86_64/src/gdb-13.2/gdb/windows-nat.c:748
was 31: A device attached to the system is not functioning.
0x0000000000000000 in ?? ()
  2    Thread 6516.0x1b28 "sig" 0x00007ff8051a8a64 in ?? ()
* 3    Thread 6516.0x12b4       0x00007ff8051b4374 in ?? ()


Let me know if I can provide any additional info, or anything else we can
try to help debug this.
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>

#ifndef BINARY
#define BINARY "/bin/true"
#endif

#ifndef ARG
#define ARG "0.1"
#endif

int main(int argc, char ** argv)
{
	while (1)
	{
		int pid;
		printf("Starting group of 100x " BINARY " " ARG "\n");
		for (int i = 0; i < 100; ++i)
		{
			pid = fork();
			if (pid == -1)
			{
				perror("fork error");
				return 1;
			}
			else if (pid == 0)
			{
				if ((pid = fork()) == 0)
				{
					char * const args[] = {BINARY, ARG, NULL};
					execv(BINARY, args);
					perror("execv failed");
					_exit(5);
				}
				if (pid == -1)
				{
					perror("inner fork error");
					_exit(1);
				}
				else
				{
					_exit(0);
				}
			}
			else
			{
				int status;
				if (waitpid(pid, &status, 0) == -1)
				{
					perror("waitpid error");
					return 2;
				}
				else if (status != 0)
				{
					fprintf(stderr, "subprocess exited non-zero: %d\n", status);
					return WEXITSTATUS(status);
				}
			}
		}
	}
	return 0;
}

Reply via email to