Hi, On 2022-08-31 11:11:54 -0700, Andres Freund wrote: > > If the above are addressed, I think this will be just about at the > > point where the above patches can be committed. > > Woo!
There was a lot less progress over the last ~week than I had hoped. The reason is that I was trying to figure out the reason for the occasional failures of ecpg tests getting compiled when building on windows in CI, with msbuild. I went into many layers of rabbitholes while investigating. Wasting an absurd amount of time. The problem: Occasionally ecpg test files would fail to compile, exiting with -1073741819: C:\BuildTools\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(241,5): error MSB8066: Custom build for 'C:\cirrus\build\meson-private\custom_target.rule' exited with code -1073741819. [c:\cirrus\build\src\interfaces\ecpg\test\sql\3701597@@twophase.c@cus.vcxproj] -1073741819 is 0xc0000005, which in turn is STATUS_ACCESS_VIOLATION, i.e. a segfault. This happens in roughly 1/3 of the builds, but with "streaks" of not happening and more frequently happening. However, despite our CI images having a JIT debugger configured (~coredump handler), no crash report was triggered. The problem never occurs in my windows VM. At first I thought that might be because it's an assertion failure or such, which only causes a dump when a bunch of magic is done (see main.c). But despite adding all the necessary magic to ecpg.exe, no dump. Unfortunately, adding debug output reduces the frequency of the issue substantially. Eventually I figured out that it's not actually ecpg.exe that is crashing. It is meson's python wrapper around built binaries as part of the build (for setting PATH, working directory, without running into cmd.exe issues). A modified meson wrapper showed that ecpg.exe completes successfully. The only thing the meson wrapper does after running the command is to call sys.exit(returncode), and I had printed out the returncode, which is 0. I looked through a lot of the python code, to see why no crashdump and no details are forthcoming. There weren't any relevant SetErrorMode(SEM_NOGPFAULTERRORBOX) calls. I tried to set PYTHONFAULTHANDLER, but still no stack trace. Next I suspected that cmd.exe might be crashing and causing the problem. Modified meson to add 'echo %ERRORLEVEL%' to the msbuild custombuild. Which indeed shows the STATUS_ACCESS_VIOLATION returncode after running python. So it's not cmd.exe. The problem even persisted when replacing meson's sys.exit() with os._exit(), which indeed just calls _exit(). I tried to reproduce the problem using a python with debugging enabled. The problem doesn't occur despite quite a few runs. I found scattered other reports of this problem happening on windows. Went down a few more rabbitholes. Too boring to repeat here. At this point I finally figured out that the reason the crash reports don't happen is that everythin started by cirrus-ci on windows has an errormode of SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX | SEM_NOOPENFILEERRORBOX. A good bit later I figured out that while cirrus-ci isn't intentionally setting that, golang does so *unconditionally* on windows: https://github.com/golang/go/blob/54182ff54a687272dd7632c3a963e036ce03cb7c/src/runtime/signal_windows.go#L14 https://github.com/golang/go/blob/54182ff54a687272dd7632c3a963e036ce03cb7c/src/runtime/os_windows.go#L553 Argh. I should have checked what the error mode is earlier, but this is just very sneaky. So I modified meson to change the errormode and tried to reproduce the issue again, to finally get a stackdump. And tried again. And again. Without a single relevant failure (I saw tests fail in ways that are discussed on the list, but that's irrelevant here). I've run this through enough attempts by now that I'm quite confident that the problem does not occur when the errormode does not include SEM_NOOPENFILEERRORBOX. I'll want a few more runs to be certain, but... Given that the problem appears to happen after _exit() is called, and only when SEM_NOOPENFILEERRORBOX is not set, it seems likely to be an OS / C runtime bug. Presumably it's related to something that python does first, but I don't see how anything could justify crashing only if SEM_NOOPENFILEERRORBOX is set (rather than the opposite). I have no idea how to debug this further, given that the problem is quite rare (can't attach a debugger and wait), only happens when crashdumps are prevented from happening (so no idea where it crashes) and is made less common by debug printfs. So for now the best way forward I can see is to change the error mode for CI runs. Which is likely a good idea anyway, so we can see crashdumps for binaries other than postgres.exe (which does SetErrorMode() internally). I managed to do so by setting CIRRUS_SHELL to a python wrapper around cmd.exe that does SetErrorMode(). I'm sure there's easier ways, but I couldn't figure out any. I'd like to reclaim my time. But I'm afraid nobody will be listening to that plea... Greetings, Andres Freund