Re: double-fork issue on Windows on ARM64

2024-05-21 Thread Jeremy Drake via Cygwin-developers
On Mon, 20 May 2024, Jeremy Drake wrote:

> Today, I was attempting to look at the TerminateThread situation.  The
> call in question comes from the attempt to terminate the wait_thread of a
> chld_procs entry.  I noticed elsewhere in cygwin code (flock.cc) that
> CancelSynchronousIo was being called, and that stood out to me because
> chances are that the wait thread (if running) is going to be blocked in
> ReadFile.  I am testing with the following hack, and so far have not seen
> a hang


I left my reproducer running with this hack, and I did eventually get an
error exit from the intermediate subprocess, which seems to have been a
signal 11 (if I'm reading the status from waitpid correctly).

What I noticed today is that in pinfo.cc, near the end of proc_waiter, it
sets vchild.wait_thread = NULL;.  If my reading of this is correct, that
does nothing useful, because vchild is a stack variable there and the
function returns soon after.  I that what that *intended* to do was to
NULL out the wait_thread pointer that would be checked in proc_terminate,
but there's no guarantee that the entry in chld_procs is in the same place
at the end of proc_waiter as it was at the start (so arg may point to
some other pinfo entirely).

Does any of this make any sense, or am I barking up the wrong tree here?


Re: double-fork issue on Windows on ARM64

2024-05-20 Thread Jeremy Drake via Cygwin-developers
On Wed, 8 May 2024, Jeremy Drake wrote:

> (this is the same issue discussed in
> https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html)
>
> On MSYS2, running on Windows on ARM64 only, we've been plagued by issues
> with processes hanging up.  Usually pacman, when it is trying to validate
> signatures with gpgme.  When a process is hung in this way, no debugger
> seems to be able to attach properly.
>
> > anecdotally, the hang occurs when _exit() calls
> > proc_terminate() which is then blocked by a call to TerminateThread()
> > with an invalid thread handle (for more details, see
> > https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-1951796327).


As a follow-up to this, that was from a proposed workaround of just
commenting out the double-fork behavior in gpgme.  After reading a comment
in the code and doing some research online, it seems the double-fork is an
accepted idiom on posix to avoid having to wait for the (grand)child,
without creating zombie processes.  I was unable to see zombie processes
in ps or /proc/, but I did see extra cygpid.* entries in
/proc/sys/BaseNamedObjects/cygwin* which seem to be much the same thing.

Today, I was attempting to look at the TerminateThread situation.  The
call in question comes from the attempt to terminate the wait_thread of a
chld_procs entry.  I noticed elsewhere in cygwin code (flock.cc) that
CancelSynchronousIo was being called, and that stood out to me because
chances are that the wait thread (if running) is going to be blocked in
ReadFile.  I am testing with the following hack, and so far have not seen
a hang:
diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc
index 86e4e607ab..020906d797 100644
--- a/winsup/cygwin/sigproc.cc
+++ b/winsup/cygwin/sigproc.cc
@@ -410,7 +410,7 @@ proc_terminate ()
  if (!have_execed || !have_execed_cygwin)
chld_procs[i]->ppid = 1;
  if (chld_procs[i].wait_thread)
-   chld_procs[i].wait_thread->terminate_thread ();
+   CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ());
  /* Release memory associated with this process unless it is 'myself'.
 'myself' is only in the chld_procs table when we've execed.  We
 reach here when the next process has finished initializing but we


As a disclaimer, I am having a hard time wrapping my head around this
code, so I don't know what kind of side-effects this may have, but it does
seem to help the hang, without resulting in "zombie" cygpid entries.

(Note that I first tried
+ if (CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle 
()))
+   chld_procs[i].wait_thread->detach ();
+ else
+   chld_procs[i].wait_thread->terminate_thread ();
but that resulted in a (debuggable) hang in detach, because the
cygthread::stub was waiting for thread_sync, while cygthread::detach was
waiting for *this.  That appears to be because this is an auto-releasing
cygthread.  It kind of bothers me that there is no synchronization to be
sure the wait_thread is done shutting down before moving on in
proc_terminate, but I don't see an obvious way in the current structure).


double-fork issue on Windows on ARM64

2024-05-08 Thread Jeremy Drake via Cygwin-developers
(this is the same issue discussed in
https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html)

On MSYS2, running on Windows on ARM64 only, we've been plagued by issues
with processes hanging up.  Usually pacman, when it is trying to validate
signatures with gpgme.  When a process is hung in this way, no debugger
seems to be able to attach properly.

After many months of off-and-on progress trying to debug this, we've
*finally* got an idea of what behavior is causing this, and a standalone
reproducer that runs on Cygwin.

> A common symptom is that the hanging process has a command-line that is
> identical to its parent process' command-line (indicating that it has
> been fork()ed), and anecdotally, the hang occurs when _exit() calls
> proc_terminate() which is then blocked by a call to TerminateThread()
> with an invalid thread handle (for more details, see
> https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-1951796327).
>
> In my tests, I found that the hanging process is spawned from
> _gpgme_io_spawn() which lets the child process immediately spawn another
> child. That seems like a fantastic way to find timing-related bugs in
> the MSYS2/Cygwin runtime.
>
> As a work-around, it does seem to help if we avoid that double-fork.

That led me to make the attached reproducer, which is based on the code
from _gpgme_io_spawn.  I originally expected that this would require some
timing adjustment, hence the defines to change the binary and argument (I
expected to use /bin/sleep and different values).  It turns out, this
reproduces readily with /bin/true.

I build this with `gcc -ggdb -o testfork testfork.c`, and this reproduces:
* on a Raspberry PI 4 running Windows 10, with an i686 msys2 runtime
* on a QC710 running Windows 11 23H2, with x86_64 msys2 runtime (this
seems to reproduce it most readily).
* on a hyper-v virtual machine on Dev Kit 2023 running Windows 11 23H2,
with x86_64 msys2 runtime or Cygwin 3.5.3.  This seems to require running
two instances of testfork.exe at the same time.

When attaching to the hung process, gdb shows
(gdb) i thr
  Id   Target IdFrame
  1Thread 6516.0xbe8error return
/cygdrive/d/a/scallywag/gdb/gdb-13.2-1.x86_64/src/gdb-13.2/gdb/windows-nat.c:748
was 31: A device attached to the system is not functioning.
0x in ?? ()
  2Thread 6516.0x1b28 "sig" 0x7ff8051a8a64 in ?? ()
* 3Thread 6516.0x12b4   0x7ff8051b4374 in ?? ()


Let me know if I can provide any additional info, or anything else we can
try to help debug this.#include 
#include 
#include 

#ifndef BINARY
#define BINARY "/bin/true"
#endif

#ifndef ARG
#define ARG "0.1"
#endif

int main(int argc, char ** argv)
{
	while (1)
	{
		int pid;
		printf("Starting group of 100x " BINARY " " ARG "\n");
		for (int i = 0; i < 100; ++i)
		{
			pid = fork();
			if (pid == -1)
			{
perror("fork error");
return 1;
			}
			else if (pid == 0)
			{
if ((pid = fork()) == 0)
{
	char * const args[] = {BINARY, ARG, NULL};
	execv(BINARY, args);
	perror("execv failed");
	_exit(5);
}
if (pid == -1)
{
	perror("inner fork error");
	_exit(1);
}
else
{
	_exit(0);
}
			}
			else
			{
int status;
if (waitpid(pid, , 0) == -1)
{
	perror("waitpid error");
	return 2;
}
else if (status != 0)
{
	fprintf(stderr, "subprocess exited non-zero: %d\n", status);
	return WEXITSTATUS(status);
}
			}
		}
	}
	return 0;
}