Hey, Seems like the pretty old problem shows itself again. I'm talking about SIGUSR2 signal :-(...Classlib's asynchronous signal reporter uses system semaphores for synchronization purposes...and hysem_wait is interrupted by the signal:
(gdb) p perror("sym_wait error:") sym_wait error:: Interrupted system call Do we have good (universal) solution for such cases? Thanks Evgueni On 11/15/06, Geir Magnusson Jr. <[EMAIL PROTECTED]> wrote:
Gregory Shimansky wrote: > Evgueni Brevnov wrote: >> hmmm.... strange. The patch was tested on multi-processor system >> running SUSE9. I will check if the patch misses something. Anyway, we >> need to wait with the patch submission until we 100% sure how >> hythread_monitor_init should behave. >> >> Thanks >> Evgueni >> >> On 11/11/06, Gregory Shimansky <[EMAIL PROTECTED]> wrote: >>> On Friday 10 November 2006 17:45 Evgueni Brevnov wrote: >>> > Hi, >>> > >>> > While investigating deadlock scenario which is described in >>> > HARMONY-2006 I found out one interesting thing. It turned out that DRL >>> > implementation of hythread_monitor_init / >>> > hythread_monitor_init_with_name initializes and acquires a monitor. >>> > Original spec reads: "Acquire and initialize a new monitor from the >>> > threading library...." AFAIU that doesn't mean to lock the monitor but >>> > get it from the threading library. So the hythread_monitor_init should >>> > not lock the monitor. >>> > >>> > Could somebody comment on that? >>> >>> It might be that semantic is different on different platforms which is >>> probably even worse. Your patch in HARMONY-2149 breaks nearly all of >>> acceptance tests on Linux while everything on Windows works (ok I >>> tested on >>> laptop with 1 processor while Linux was a HT server, sometimes it is >>> important for threading). > > I've tried to investigate the problem but didn't find the end of it yet. > The bug seems to be ubuntu specific (<joke>shall we maybe call this > distribution buggy and move on?</joke>). There is something odd about it, I'll admit... Remember the EOMEM bugs I found in forking? I didn't reproduce it on > gentoo, all tests work just fine. > > The bug look likes this, on tests gc.Force, gc.LOS, gc.List, gc.NPE, > gc.PhantomReferenceTest, gc.WeakReferenceTest, stress.WeakHashMapTest VM > segfaults. The stack looks like an infinite recursion of 4 stack frames: > > #0 0xb6dcb814 in null_java_reference_handler (signum=11, > info=0xb71a503c, context=0xb71a50bc) at > /nfs/ims/proj/drl/mrt1/users/gregory/Harmony/enhanced/drlvm/trunk/vm/vmco > re/src/util/linux/signals_ia32.cpp:443 > #1 <signal handler called> > #2 0xb6dcc20a in get_stack_addr () at > /nfs/ims/proj/drl/mrt1/users/gregory/Harmony/enhanced/drlvm/trunk/vm/vmco > re/src/util/linux/signals_ia32.cpp:293 > #3 0xb6dcb6cd in check_stack_overflow (info=0xb71a546c, uc=0xb71a54ec) > at > /nfs/ims/proj/drl/mrt1/users/gregory/Harmony/enhanced/drlvm/trunk/vm/vmco > re/src/util/linux/signals_ia32.cpp:399 > #4 0xb6dcb900 in null_java_reference_handler (signum=11, > info=0xb71a546c, context=0xb71a54ec) at > /nfs/ims/proj/drl/mrt1/users/gregory/Harmony/enhanced/drlvm/trunk/vm/vmco > re/src/util/linux/signals_ia32.cpp:451 > > and so on. The stack is very long. When I run VM with -Xtrace:signals I > get a very long log of messages that "NPE or SOE detected at ...". The > first time address always varies, but it appears to be memcpy. The next > addresses are always the same, they point to get_stack_addr function. > > So I tried to find out why memcpy crashes in the first place. It appears > to be a struct copy called from jsig_handler hysig. The stack looks like > this (if I can trust gdb on ubuntu): > > #0 0xb7a9b9dc in memcpy () from /lib/tls/i686/cmov/libc.so.6 > #1 0xb7ba0fa0 in jsig_handler (sig=-1215196204, siginfo=0x0, uc=0x0) > at hysigunix.c:169 > #2 0xb7f9ec8b in asynchSignalReporter (userData=0x0) at hysignal.c:971 > #3 0xb7baa8ef in thread_start_proc (thd=0x807a8e8, p_args=0x807a8d8) > at > /nfs/ims/proj/drl/mrt1/users/gregory/Harmony/enhanced/drlvm/trunk/vm/thread/src/thread_native_basic.c:712 > > #4 0xb7bb0ed4 in dummy_worker (opaque=0x0) at threadproc/unix/thread.c:138 > #5 0xb7b65341 in start_thread () from lib/tls/i686/cmov/libpthread.so.0 > #6 0xb7af94ee in clone () from /lib/tls/i686/cmov/libc.so.6 > > In jsig_handler a struct of type sigaction is copied > > act = saved_sigaction[sig]; > > and gcc replaces this statement with a call to memcpy it seems. But the > parameter sig is quite weird if you look at it. It is sig=-1215196204... > Now if I could only find where and this sig happened there... I cannot > find it in the depth of classlib native code this late at night. >