Re: cmake hang solution?

2022-05-15 Thread Thomas Klausner
On Tue, Apr 05, 2022 at 03:31:33PM +0200, Joerg Sonnenberger wrote:
> Am Tue, Apr 05, 2022 at 02:43:08PM +0200 schrieb Thomas Klausner:
> > Hi!
> > 
> > OpenBSD fixed a problem in libuv related to kevent, here's a writeup
> > by Ted Unangst.
> > 
> > https://flak.tedunangst.com/post/sometimes-the-knote-comes-early
> > 
> > Is that the same problem we're seeing with cmake hangs?
> 
> I don't think it matches the hangs I analyzed.

chuq debugged a cmake hang on my machine and came up with a patch that
seems to reliably fix it for me (I could reproduce it in a couple
hours, now it didn't happen for nearly two days).

Please update to libuv-1.44.1nb1.

https://mail-index.netbsd.org/pkgsrc-changes/2022/05/15/msg254562.html

I see no cmake hangs any longer (on a system with the uncommitted
pthread patch and this libuv version).
 Thomas


Re: cmake hang solution?

2022-05-09 Thread Michael van Elst
mlel...@serpens.de (Michael van Elst) writes:

>I'm currently testing:
>Index: lib/libpthread/pthread.c
>===
>RCS file: /cvsroot/src/lib/libpthread/pthread.c,v
>retrieving revision 1.153.2.1
>diff -p -u -r1.153.2.1 pthread.c
>--- lib/libpthread/pthread.c26 Jan 2020 10:55:16 -  1.153.2.1
>+++ lib/libpthread/pthread.c3 May 2022 09:22:58 -
>@@ -430,6 +430,8 @@ pthread_create(pthread_t *thread, const 
>   * only be one thread before it becomes true.
>   */
>   if (pthread__started == 0) {
>+  _lwp_park(CLOCK_REALTIME, 0, NULL,
>+  pthread__self()->pt_lid, NULL, NULL);
>   pthread__start();
>   pthread__started = 1;
>   }


With this patch there was so far no cmake/guile type hang in several
pbulk runs.

The samba / python hangup seems to be caused by something else.



Re: cmake hang solution?

2022-05-04 Thread Chuck Silvers
On Mon, May 02, 2022 at 08:16:42PM +0200, Manuel Bouyer wrote:
> On Mon, May 02, 2022 at 11:13:45AM -0700, Chuck Silvers wrote:
> > it looks like the diff won't apply as-is, but I think the concept still 
> > applies.
> > 
> > note that there have been a LOT of changes in libpthread since netbsd-9,
> > and some of those changes also claim to fix problems where threads hang
> > waiting on locks and/or condvars.  it would be more useful to test
> > with a HEAD libpthread (which I'll guess requires a HEAD libc too).

wiz@ tells me that he has reproduced the cmake hang with HEAD kernel and 
userland
plus my libpthread patch, so no one else needs to try that now.
I'll work with him to figure out what else is still causing the cmake variation 
of this.

-Chuck


Re: cmake hang solution?

2022-05-03 Thread Michael van Elst
mlel...@serpens.de (Michael van Elst) writes:

>c...@chuq.com (Chuck Silvers) writes:

>>> would this apply to netbsd-9 too ? The hang I'm seeing is on a system
>>> with a HEAD kernel and a netbsd-9 userland 

>>it looks like the diff won't apply as-is, but I think the concept still 
>>applies.

>I'm currently testing:


Not really successful. I got a a hang with all but one threads
in park and one waiting for kqueue.

Parked threads look like:

#0  0x7d0ce8ca220a in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x7d0ce9e0addf in pthread_cond_timedwait () from 
/usr/lib/libpthread.so.1
#2  0x7d0ce9aa1412 in 
std::condition_variable::wait(std::unique_lock&) () from 
/usr/lib/libstdc++.so.9
#3  0xa48f7685 in cmWorkerPoolInternal::Work(unsigned int) ()
#4  0x7d0ce9a9f5aa in ?? () from /usr/lib/libstdc++.so.9
#5  0x7d0ce9e0c757 in ?? () from /usr/lib/libpthread.so.1
#6  0x7d0ce8c87e10 in ?? () from /usr/lib/libc.so.12
#7  0x in ?? ()

except for one:

#0  0x7d0ce8ca220a in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x7d0ce9e0addf in pthread_cond_timedwait () from 
/usr/lib/libpthread.so.1
#2  0x7d0ce9aa1412 in 
std::condition_variable::wait(std::unique_lock&) () from 
/usr/lib/libstdc++.so.9
#3  0xa48f812e in 
cmWorkerPoolWorker::RunProcess(cmWorkerPool::ProcessResultT&, 
std::vector, 
std::allocator >, std::allocator, std::allocator > > > const&, 
std::__cxx11::basic_string, std::allocator > 
const&) ()
#4  0xa489731c in (anonymous 
namespace)::cmQtAutoMocUicT::JobT::RunProcess(cmQtAutoGen::GenT, 
cmWorkerPool::ProcessResultT&, std::vector, std::allocator >, 
std::allocator, 
std::allocator > > > const&, std::__cxx11::basic_string, std::allocator >*) ()
#5  0xa48aadc8 in (anonymous 
namespace)::cmQtAutoMocUicT::JobCompileMocT::Process() ()
#6  0xa48f75b6 in cmWorkerPoolInternal::Work(unsigned int) ()
#7  0x7d0ce9a9f5aa in ?? () from /usr/lib/libstdc++.so.9
#8  0x7d0ce9e0c757 in ?? () from /usr/lib/libpthread.so.1
#9  0x7d0ce8c87e10 in ?? () from /usr/lib/libc.so.12
#10 0x in ?? ()

The kqueue wait looks like:

#0  0x7d0ce8c42e6a in _sys___kevent50 () from /usr/lib/libc.so.12
#1  0x7d0ce9e07cb9 in __kevent50 () from /usr/lib/libpthread.so.1
#2  0x7d0ceac1c650 in uv.io_poll () from /usr/pkg/lib/libuv.so.1
#3  0x7d0ceac0de48 in uv_run () from /usr/pkg/lib/libuv.so.1
#4  0xa48f70a8 in cmWorkerPoolInternal::Process() ()
#5  0xa48f7102 in cmWorkerPool::Process(void*) ()
#6  0xa48ae2bf in (anonymous namespace)::cmQtAutoMocUicT::Process() ()
#7  0xa4b8b2ea in cmQtAutoGenerator::Run(std::basic_string_view >, std::basic_string_view 
>) ()
#8  0xa489b150 in cmQtAutoMocUic(std::basic_string_view >, std::basic_string_view 
>) ()
#9  0xa4838c3c in 
cmcmd::ExecuteCMakeCommand(std::vector, std::allocator >, 
std::allocator, 
std::allocator > > > const&, std::unique_ptr >) ()
#10 0xa4c3c6bc in main ()




Re: cmake hang solution?

2022-05-03 Thread Chuck Silvers
On Mon, May 02, 2022 at 10:12:02PM +0200, Michael van Elst wrote:
> On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote:
> > On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote:
> > > I see both in almost every pbulk run.
> > 
> > please try this patch for the cmake variation of this hang:
> > 
> > http://www.netbsd.org/~chs/diff.pthread-park-stuck.1
> 
> The bulk builds use the latest release, i.e netbsd-9, but that
> patch is for -current. Do you think that netbsd-9 has the same
> issue and the patch could be reworked for the older code ?

netbsd-9 almost certainly has the issue that this latest patch
is trying to fix.  the patch above is just a one-liner and can easily
be adapted for netbsd-9.  but this patch alone is probably not enough...
there have been many changes to libpthread in HEAD since netbsd-9,
claiming to fix problems with the same kind of hang symptoms,
and some of those would almost certainly be needed in netbsd-9 as well.
I don't know for sure which of the changes from HEAD are needed,
so that's why I'm suggesting we just test with all of those changes,
ie. test the HEAD libpthread.

-Chuck


Re: cmake hang solution?

2022-05-03 Thread Chuck Silvers
On Mon, May 02, 2022 at 08:16:42PM +0200, Manuel Bouyer wrote:
> On Mon, May 02, 2022 at 11:13:45AM -0700, Chuck Silvers wrote:
> > it looks like the diff won't apply as-is, but I think the concept still 
> > applies.
> > 
> > note that there have been a LOT of changes in libpthread since netbsd-9,
> > and some of those changes also claim to fix problems where threads hang
> > waiting on locks and/or condvars.  it would be more useful to test
> > with a HEAD libpthread (which I'll guess requires a HEAD libc too).
> 
> the goal is to build the official netbsd-9 packages, so that's not an option

it's possible to have individual binaries use shared libraries from a
non-standard location, such that you could have cmake itself use the
HEAD shared libraries to run, but that everything that the build process
builds would use link against the netbsd-9 shared libraries in the
normal location.  I've done this kind of thing before by just editing
an executable binary to change the RPATH in the ELF headers
(or if the target executable has no RPATH then editing the binary
to use an alterate ld.elf_so, and having the alternate ld.elf_so use
an alternate default RPATH.)  if you don't like that then you could also
change the cmake build to link the cmake executable with an alternate rpath.

I'm only suggesting all of this as a way to test if all of these hang bugs
are fixed in HEAD, not that you should run your netbsd-9 userland system
with this hacky setup indefinitely.

-Chuck


Re: cmake hang solution?

2022-05-03 Thread Michael van Elst
c...@chuq.com (Chuck Silvers) writes:

>> would this apply to netbsd-9 too ? The hang I'm seeing is on a system
>> with a HEAD kernel and a netbsd-9 userland 

>it looks like the diff won't apply as-is, but I think the concept still 
>applies.

I'm currently testing:

Index: lib/libpthread/pthread.c
===
RCS file: /cvsroot/src/lib/libpthread/pthread.c,v
retrieving revision 1.153.2.1
diff -p -u -r1.153.2.1 pthread.c
--- lib/libpthread/pthread.c26 Jan 2020 10:55:16 -  1.153.2.1
+++ lib/libpthread/pthread.c3 May 2022 09:22:58 -
@@ -430,6 +430,8 @@ pthread_create(pthread_t *thread, const 
* only be one thread before it becomes true.
*/
if (pthread__started == 0) {
+   _lwp_park(CLOCK_REALTIME, 0, NULL,
+   pthread__self()->pt_lid, NULL, NULL);
pthread__start();
pthread__started = 1;
}




Re: cmake hang solution?

2022-05-02 Thread Michael van Elst
On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote:
> On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote:
> > I see both in almost every pbulk run.
> 
> please try this patch for the cmake variation of this hang:
> 
> http://www.netbsd.org/~chs/diff.pthread-park-stuck.1

The bulk builds use the latest release, i.e netbsd-9, but that
patch is for -current. Do you think that netbsd-9 has the same
issue and the patch could be reworked for the older code ?


Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: cmake hang solution?

2022-05-02 Thread Manuel Bouyer
On Mon, May 02, 2022 at 11:13:45AM -0700, Chuck Silvers wrote:
> it looks like the diff won't apply as-is, but I think the concept still 
> applies.
> 
> note that there have been a LOT of changes in libpthread since netbsd-9,
> and some of those changes also claim to fix problems where threads hang
> waiting on locks and/or condvars.  it would be more useful to test
> with a HEAD libpthread (which I'll guess requires a HEAD libc too).

the goal is to build the official netbsd-9 packages, so that's not an option

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: cmake hang solution?

2022-05-02 Thread Chuck Silvers
On Mon, May 02, 2022 at 12:55:39PM +0200, Manuel Bouyer wrote:
> On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote:
> > On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote:
> > > w...@netbsd.org (Thomas Klausner) writes:
> > > >I never saw the cmake hang myself. I still see hangs in guile.
> > > 
> > > 
> > > I see both in almost every pbulk run.
> > 
> > 
> > please try this patch for the cmake variation of this hang:
> > 
> > http://www.netbsd.org/~chs/diff.pthread-park-stuck.1
> 
> would this apply to netbsd-9 too ? The hang I'm seeing is on a system
> with a HEAD kernel and a netbsd-9 userland 

it looks like the diff won't apply as-is, but I think the concept still applies.

note that there have been a LOT of changes in libpthread since netbsd-9,
and some of those changes also claim to fix problems where threads hang
waiting on locks and/or condvars.  it would be more useful to test
with a HEAD libpthread (which I'll guess requires a HEAD libc too).

-Chuck


Re: cmake hang solution?

2022-05-02 Thread Joerg Sonnenberger
Am Sun, May 01, 2022 at 01:24:01PM -0700 schrieb Chuck Silvers:
> the above patch simply resolves the symbol for the libpthread call to 
> _lwp_park
> while the process is still single-threaded, by calling the _lwp_park to both
> unpark and park itself, which just returns immediately.  after that,
> the libpthread calls to _lwp_park will no longer call into rtld,
> so attempts to unpark libpthread can no longer be lost.

An easier fix might be to just use -z now with a comment about ensuring
that the system calls are not resolved lazily due to the overlap with
rtld.

Joerg


Re: cmake hang solution?

2022-05-02 Thread Manuel Bouyer
On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote:
> On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote:
> > w...@netbsd.org (Thomas Klausner) writes:
> > >I never saw the cmake hang myself. I still see hangs in guile.
> > 
> > 
> > I see both in almost every pbulk run.
> 
> 
> please try this patch for the cmake variation of this hang:
> 
> http://www.netbsd.org/~chs/diff.pthread-park-stuck.1

would this apply to netbsd-9 too ? The hang I'm seeing is on a system
with a HEAD kernel and a netbsd-9 userland 

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: cmake hang solution?

2022-05-01 Thread Chuck Silvers
On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote:
> w...@netbsd.org (Thomas Klausner) writes:
> >I never saw the cmake hang myself. I still see hangs in guile.
> 
> 
> I see both in almost every pbulk run.


please try this patch for the cmake variation of this hang:

http://www.netbsd.org/~chs/diff.pthread-park-stuck.1

this fixes the problem as seen with taylor's strdup / jemalloc reproducer,
and paulg reports it fixes the hang in building guile too.

what's going on here is that the first time that libpthread calls _lwp_park
when it wants to sleep to wait for a mutex, instead of calling the libc
function directly it first has to call into rtld to resolve the symbol.
the rtld code will call _rtld_shared_enter(), which might also need to sleep
using _lwp_park to wait for the rtld internal lock.  the rtld internal usage
of _lwp_park can accidentally consume an unpark from another thread that was
intended for the libpthread code, and if that happens, then when rtld is done
resolving the symbol and libpthread actually calls the real _lwp_park function,
the unpark has been lost and the libpthread call to _lwp_park will
sleep forever.

the above patch simply resolves the symbol for the libpthread call to _lwp_park
while the process is still single-threaded, by calling the _lwp_park to both
unpark and park itself, which just returns immediately.  after that,
the libpthread calls to _lwp_park will no longer call into rtld,
so attempts to unpark libpthread can no longer be lost.

when I wrote that patch I thought it would be a complete fix, but upon reading
the previous email threads about this problem I saw a mention of signals,
and signal handlers can call a wide variety of functions, so this patch
turns out to only fix what is probably the most common way that this problem
manifests.

the nature of lwp_park/unpark (with just a single "already unparked" flag
per thread in the kernel) is such that they cannot safely be used in a nested
fashion like they are in libpthread and rtld, so we need to change one or both
of these callers to use some other primitive to implement sleeping to wait
for a lock, such as futexes, which do not have this kind of per-thread flag
that prevents safe nested usage.

-Chuck


Re: cmake hang solution?

2022-04-05 Thread Michael van Elst
w...@netbsd.org (Thomas Klausner) writes:

>On Tue, Apr 05, 2022 at 01:26:54PM -, Michael van Elst wrote:
>> w...@netbsd.org (Thomas Klausner) writes:
>> 
>> >OpenBSD fixed a problem in libuv related to kevent, here's a writeup
>> >by Ted Unangst.
>> 
>> >https://flak.tedunangst.com/post/sometimes-the-knote-comes-early
>> 
>> The "closing the race" part seems to be included in the latest kevent
>> changes in -current.

>Good to know.

>pkgsrc already has libuv 1.44.1 which contains the libuv fix.

>I never saw the cmake hang myself. I still see hangs in guile.


I see both in almost every pbulk run.



Re: cmake hang solution?

2022-04-05 Thread Thomas Klausner
On Tue, Apr 05, 2022 at 01:26:54PM -, Michael van Elst wrote:
> w...@netbsd.org (Thomas Klausner) writes:
> 
> >OpenBSD fixed a problem in libuv related to kevent, here's a writeup
> >by Ted Unangst.
> 
> >https://flak.tedunangst.com/post/sometimes-the-knote-comes-early
> 
> The "closing the race" part seems to be included in the latest kevent
> changes in -current.

Good to know.

pkgsrc already has libuv 1.44.1 which contains the libuv fix.

I never saw the cmake hang myself. I still see hangs in guile.
 Thomas


Re: cmake hang solution?

2022-04-05 Thread Joerg Sonnenberger
Am Tue, Apr 05, 2022 at 02:43:08PM +0200 schrieb Thomas Klausner:
> Hi!
> 
> OpenBSD fixed a problem in libuv related to kevent, here's a writeup
> by Ted Unangst.
> 
> https://flak.tedunangst.com/post/sometimes-the-knote-comes-early
> 
> Is that the same problem we're seeing with cmake hangs?

I don't think it matches the hangs I analyzed.

Joerg


Re: cmake hang solution?

2022-04-05 Thread Michael van Elst
w...@netbsd.org (Thomas Klausner) writes:

>OpenBSD fixed a problem in libuv related to kevent, here's a writeup
>by Ted Unangst.

>https://flak.tedunangst.com/post/sometimes-the-knote-comes-early

The "closing the race" part seems to be included in the latest kevent
changes in -current.