Re: cmake hang solution?
On Tue, Apr 05, 2022 at 03:31:33PM +0200, Joerg Sonnenberger wrote: > Am Tue, Apr 05, 2022 at 02:43:08PM +0200 schrieb Thomas Klausner: > > Hi! > > > > OpenBSD fixed a problem in libuv related to kevent, here's a writeup > > by Ted Unangst. > > > > https://flak.tedunangst.com/post/sometimes-the-knote-comes-early > > > > Is that the same problem we're seeing with cmake hangs? > > I don't think it matches the hangs I analyzed. chuq debugged a cmake hang on my machine and came up with a patch that seems to reliably fix it for me (I could reproduce it in a couple hours, now it didn't happen for nearly two days). Please update to libuv-1.44.1nb1. https://mail-index.netbsd.org/pkgsrc-changes/2022/05/15/msg254562.html I see no cmake hangs any longer (on a system with the uncommitted pthread patch and this libuv version). Thomas
Re: cmake hang solution?
mlel...@serpens.de (Michael van Elst) writes: >I'm currently testing: >Index: lib/libpthread/pthread.c >=== >RCS file: /cvsroot/src/lib/libpthread/pthread.c,v >retrieving revision 1.153.2.1 >diff -p -u -r1.153.2.1 pthread.c >--- lib/libpthread/pthread.c26 Jan 2020 10:55:16 - 1.153.2.1 >+++ lib/libpthread/pthread.c3 May 2022 09:22:58 - >@@ -430,6 +430,8 @@ pthread_create(pthread_t *thread, const > * only be one thread before it becomes true. > */ > if (pthread__started == 0) { >+ _lwp_park(CLOCK_REALTIME, 0, NULL, >+ pthread__self()->pt_lid, NULL, NULL); > pthread__start(); > pthread__started = 1; > } With this patch there was so far no cmake/guile type hang in several pbulk runs. The samba / python hangup seems to be caused by something else.
Re: cmake hang solution?
On Mon, May 02, 2022 at 08:16:42PM +0200, Manuel Bouyer wrote: > On Mon, May 02, 2022 at 11:13:45AM -0700, Chuck Silvers wrote: > > it looks like the diff won't apply as-is, but I think the concept still > > applies. > > > > note that there have been a LOT of changes in libpthread since netbsd-9, > > and some of those changes also claim to fix problems where threads hang > > waiting on locks and/or condvars. it would be more useful to test > > with a HEAD libpthread (which I'll guess requires a HEAD libc too). wiz@ tells me that he has reproduced the cmake hang with HEAD kernel and userland plus my libpthread patch, so no one else needs to try that now. I'll work with him to figure out what else is still causing the cmake variation of this. -Chuck
Re: cmake hang solution?
mlel...@serpens.de (Michael van Elst) writes: >c...@chuq.com (Chuck Silvers) writes: >>> would this apply to netbsd-9 too ? The hang I'm seeing is on a system >>> with a HEAD kernel and a netbsd-9 userland >>it looks like the diff won't apply as-is, but I think the concept still >>applies. >I'm currently testing: Not really successful. I got a a hang with all but one threads in park and one waiting for kqueue. Parked threads look like: #0 0x7d0ce8ca220a in ___lwp_park60 () from /usr/lib/libc.so.12 #1 0x7d0ce9e0addf in pthread_cond_timedwait () from /usr/lib/libpthread.so.1 #2 0x7d0ce9aa1412 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/libstdc++.so.9 #3 0xa48f7685 in cmWorkerPoolInternal::Work(unsigned int) () #4 0x7d0ce9a9f5aa in ?? () from /usr/lib/libstdc++.so.9 #5 0x7d0ce9e0c757 in ?? () from /usr/lib/libpthread.so.1 #6 0x7d0ce8c87e10 in ?? () from /usr/lib/libc.so.12 #7 0x in ?? () except for one: #0 0x7d0ce8ca220a in ___lwp_park60 () from /usr/lib/libc.so.12 #1 0x7d0ce9e0addf in pthread_cond_timedwait () from /usr/lib/libpthread.so.1 #2 0x7d0ce9aa1412 in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/libstdc++.so.9 #3 0xa48f812e in cmWorkerPoolWorker::RunProcess(cmWorkerPool::ProcessResultT&, std::vector, std::allocator >, std::allocator, std::allocator > > > const&, std::__cxx11::basic_string, std::allocator > const&) () #4 0xa489731c in (anonymous namespace)::cmQtAutoMocUicT::JobT::RunProcess(cmQtAutoGen::GenT, cmWorkerPool::ProcessResultT&, std::vector, std::allocator >, std::allocator, std::allocator > > > const&, std::__cxx11::basic_string, std::allocator >*) () #5 0xa48aadc8 in (anonymous namespace)::cmQtAutoMocUicT::JobCompileMocT::Process() () #6 0xa48f75b6 in cmWorkerPoolInternal::Work(unsigned int) () #7 0x7d0ce9a9f5aa in ?? () from /usr/lib/libstdc++.so.9 #8 0x7d0ce9e0c757 in ?? () from /usr/lib/libpthread.so.1 #9 0x7d0ce8c87e10 in ?? () from /usr/lib/libc.so.12 #10 0x in ?? () The kqueue wait looks like: #0 0x7d0ce8c42e6a in _sys___kevent50 () from /usr/lib/libc.so.12 #1 0x7d0ce9e07cb9 in __kevent50 () from /usr/lib/libpthread.so.1 #2 0x7d0ceac1c650 in uv.io_poll () from /usr/pkg/lib/libuv.so.1 #3 0x7d0ceac0de48 in uv_run () from /usr/pkg/lib/libuv.so.1 #4 0xa48f70a8 in cmWorkerPoolInternal::Process() () #5 0xa48f7102 in cmWorkerPool::Process(void*) () #6 0xa48ae2bf in (anonymous namespace)::cmQtAutoMocUicT::Process() () #7 0xa4b8b2ea in cmQtAutoGenerator::Run(std::basic_string_view >, std::basic_string_view >) () #8 0xa489b150 in cmQtAutoMocUic(std::basic_string_view >, std::basic_string_view >) () #9 0xa4838c3c in cmcmd::ExecuteCMakeCommand(std::vector, std::allocator >, std::allocator, std::allocator > > > const&, std::unique_ptr >) () #10 0xa4c3c6bc in main ()
Re: cmake hang solution?
On Mon, May 02, 2022 at 10:12:02PM +0200, Michael van Elst wrote: > On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote: > > On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote: > > > I see both in almost every pbulk run. > > > > please try this patch for the cmake variation of this hang: > > > > http://www.netbsd.org/~chs/diff.pthread-park-stuck.1 > > The bulk builds use the latest release, i.e netbsd-9, but that > patch is for -current. Do you think that netbsd-9 has the same > issue and the patch could be reworked for the older code ? netbsd-9 almost certainly has the issue that this latest patch is trying to fix. the patch above is just a one-liner and can easily be adapted for netbsd-9. but this patch alone is probably not enough... there have been many changes to libpthread in HEAD since netbsd-9, claiming to fix problems with the same kind of hang symptoms, and some of those would almost certainly be needed in netbsd-9 as well. I don't know for sure which of the changes from HEAD are needed, so that's why I'm suggesting we just test with all of those changes, ie. test the HEAD libpthread. -Chuck
Re: cmake hang solution?
On Mon, May 02, 2022 at 08:16:42PM +0200, Manuel Bouyer wrote: > On Mon, May 02, 2022 at 11:13:45AM -0700, Chuck Silvers wrote: > > it looks like the diff won't apply as-is, but I think the concept still > > applies. > > > > note that there have been a LOT of changes in libpthread since netbsd-9, > > and some of those changes also claim to fix problems where threads hang > > waiting on locks and/or condvars. it would be more useful to test > > with a HEAD libpthread (which I'll guess requires a HEAD libc too). > > the goal is to build the official netbsd-9 packages, so that's not an option it's possible to have individual binaries use shared libraries from a non-standard location, such that you could have cmake itself use the HEAD shared libraries to run, but that everything that the build process builds would use link against the netbsd-9 shared libraries in the normal location. I've done this kind of thing before by just editing an executable binary to change the RPATH in the ELF headers (or if the target executable has no RPATH then editing the binary to use an alterate ld.elf_so, and having the alternate ld.elf_so use an alternate default RPATH.) if you don't like that then you could also change the cmake build to link the cmake executable with an alternate rpath. I'm only suggesting all of this as a way to test if all of these hang bugs are fixed in HEAD, not that you should run your netbsd-9 userland system with this hacky setup indefinitely. -Chuck
Re: cmake hang solution?
c...@chuq.com (Chuck Silvers) writes: >> would this apply to netbsd-9 too ? The hang I'm seeing is on a system >> with a HEAD kernel and a netbsd-9 userland >it looks like the diff won't apply as-is, but I think the concept still >applies. I'm currently testing: Index: lib/libpthread/pthread.c === RCS file: /cvsroot/src/lib/libpthread/pthread.c,v retrieving revision 1.153.2.1 diff -p -u -r1.153.2.1 pthread.c --- lib/libpthread/pthread.c26 Jan 2020 10:55:16 - 1.153.2.1 +++ lib/libpthread/pthread.c3 May 2022 09:22:58 - @@ -430,6 +430,8 @@ pthread_create(pthread_t *thread, const * only be one thread before it becomes true. */ if (pthread__started == 0) { + _lwp_park(CLOCK_REALTIME, 0, NULL, + pthread__self()->pt_lid, NULL, NULL); pthread__start(); pthread__started = 1; }
Re: cmake hang solution?
On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote: > On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote: > > I see both in almost every pbulk run. > > please try this patch for the cmake variation of this hang: > > http://www.netbsd.org/~chs/diff.pthread-park-stuck.1 The bulk builds use the latest release, i.e netbsd-9, but that patch is for -current. Do you think that netbsd-9 has the same issue and the patch could be reworked for the older code ? Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: cmake hang solution?
On Mon, May 02, 2022 at 11:13:45AM -0700, Chuck Silvers wrote: > it looks like the diff won't apply as-is, but I think the concept still > applies. > > note that there have been a LOT of changes in libpthread since netbsd-9, > and some of those changes also claim to fix problems where threads hang > waiting on locks and/or condvars. it would be more useful to test > with a HEAD libpthread (which I'll guess requires a HEAD libc too). the goal is to build the official netbsd-9 packages, so that's not an option -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: cmake hang solution?
On Mon, May 02, 2022 at 12:55:39PM +0200, Manuel Bouyer wrote: > On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote: > > On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote: > > > w...@netbsd.org (Thomas Klausner) writes: > > > >I never saw the cmake hang myself. I still see hangs in guile. > > > > > > > > > I see both in almost every pbulk run. > > > > > > please try this patch for the cmake variation of this hang: > > > > http://www.netbsd.org/~chs/diff.pthread-park-stuck.1 > > would this apply to netbsd-9 too ? The hang I'm seeing is on a system > with a HEAD kernel and a netbsd-9 userland it looks like the diff won't apply as-is, but I think the concept still applies. note that there have been a LOT of changes in libpthread since netbsd-9, and some of those changes also claim to fix problems where threads hang waiting on locks and/or condvars. it would be more useful to test with a HEAD libpthread (which I'll guess requires a HEAD libc too). -Chuck
Re: cmake hang solution?
Am Sun, May 01, 2022 at 01:24:01PM -0700 schrieb Chuck Silvers: > the above patch simply resolves the symbol for the libpthread call to > _lwp_park > while the process is still single-threaded, by calling the _lwp_park to both > unpark and park itself, which just returns immediately. after that, > the libpthread calls to _lwp_park will no longer call into rtld, > so attempts to unpark libpthread can no longer be lost. An easier fix might be to just use -z now with a comment about ensuring that the system calls are not resolved lazily due to the overlap with rtld. Joerg
Re: cmake hang solution?
On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote: > On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote: > > w...@netbsd.org (Thomas Klausner) writes: > > >I never saw the cmake hang myself. I still see hangs in guile. > > > > > > I see both in almost every pbulk run. > > > please try this patch for the cmake variation of this hang: > > http://www.netbsd.org/~chs/diff.pthread-park-stuck.1 would this apply to netbsd-9 too ? The hang I'm seeing is on a system with a HEAD kernel and a netbsd-9 userland -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: cmake hang solution?
On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote: > w...@netbsd.org (Thomas Klausner) writes: > >I never saw the cmake hang myself. I still see hangs in guile. > > > I see both in almost every pbulk run. please try this patch for the cmake variation of this hang: http://www.netbsd.org/~chs/diff.pthread-park-stuck.1 this fixes the problem as seen with taylor's strdup / jemalloc reproducer, and paulg reports it fixes the hang in building guile too. what's going on here is that the first time that libpthread calls _lwp_park when it wants to sleep to wait for a mutex, instead of calling the libc function directly it first has to call into rtld to resolve the symbol. the rtld code will call _rtld_shared_enter(), which might also need to sleep using _lwp_park to wait for the rtld internal lock. the rtld internal usage of _lwp_park can accidentally consume an unpark from another thread that was intended for the libpthread code, and if that happens, then when rtld is done resolving the symbol and libpthread actually calls the real _lwp_park function, the unpark has been lost and the libpthread call to _lwp_park will sleep forever. the above patch simply resolves the symbol for the libpthread call to _lwp_park while the process is still single-threaded, by calling the _lwp_park to both unpark and park itself, which just returns immediately. after that, the libpthread calls to _lwp_park will no longer call into rtld, so attempts to unpark libpthread can no longer be lost. when I wrote that patch I thought it would be a complete fix, but upon reading the previous email threads about this problem I saw a mention of signals, and signal handlers can call a wide variety of functions, so this patch turns out to only fix what is probably the most common way that this problem manifests. the nature of lwp_park/unpark (with just a single "already unparked" flag per thread in the kernel) is such that they cannot safely be used in a nested fashion like they are in libpthread and rtld, so we need to change one or both of these callers to use some other primitive to implement sleeping to wait for a lock, such as futexes, which do not have this kind of per-thread flag that prevents safe nested usage. -Chuck
Re: cmake hang solution?
w...@netbsd.org (Thomas Klausner) writes: >On Tue, Apr 05, 2022 at 01:26:54PM -, Michael van Elst wrote: >> w...@netbsd.org (Thomas Klausner) writes: >> >> >OpenBSD fixed a problem in libuv related to kevent, here's a writeup >> >by Ted Unangst. >> >> >https://flak.tedunangst.com/post/sometimes-the-knote-comes-early >> >> The "closing the race" part seems to be included in the latest kevent >> changes in -current. >Good to know. >pkgsrc already has libuv 1.44.1 which contains the libuv fix. >I never saw the cmake hang myself. I still see hangs in guile. I see both in almost every pbulk run.
Re: cmake hang solution?
On Tue, Apr 05, 2022 at 01:26:54PM -, Michael van Elst wrote: > w...@netbsd.org (Thomas Klausner) writes: > > >OpenBSD fixed a problem in libuv related to kevent, here's a writeup > >by Ted Unangst. > > >https://flak.tedunangst.com/post/sometimes-the-knote-comes-early > > The "closing the race" part seems to be included in the latest kevent > changes in -current. Good to know. pkgsrc already has libuv 1.44.1 which contains the libuv fix. I never saw the cmake hang myself. I still see hangs in guile. Thomas
Re: cmake hang solution?
Am Tue, Apr 05, 2022 at 02:43:08PM +0200 schrieb Thomas Klausner: > Hi! > > OpenBSD fixed a problem in libuv related to kevent, here's a writeup > by Ted Unangst. > > https://flak.tedunangst.com/post/sometimes-the-knote-comes-early > > Is that the same problem we're seeing with cmake hangs? I don't think it matches the hangs I analyzed. Joerg
Re: cmake hang solution?
w...@netbsd.org (Thomas Klausner) writes: >OpenBSD fixed a problem in libuv related to kevent, here's a writeup >by Ted Unangst. >https://flak.tedunangst.com/post/sometimes-the-knote-comes-early The "closing the race" part seems to be included in the latest kevent changes in -current.