Just realized I forgot one possibly important point. All I/O is now being done in /dev/shm (tmpfs) so any clustered file system load/performance issues should not affect the code.
Cheers. From: <mono-devel-list-boun...@lists.ximian.com<mailto:mono-devel-list-boun...@lists.ximian.com>> on behalf of Glover George <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> Date: Tuesday, June 21, 2016 at 11:30 AM To: Rodrigo Kumpera <kump...@gmail.com<mailto:kump...@gmail.com>> Cc: "mono-devel-list@lists.ximian.com<mailto:mono-devel-list@lists.ximian.com>" <mono-devel-list@lists.ximian.com<mailto:mono-devel-list@lists.ximian.com>> Subject: Re: [Mono-dev] FW: Random hangs while running mono app Hi Rodrigo, I finally can semi-reliably (most of the time) reproduce these hung processes. The strace, gdb backtrace, and CUE cards for a single hung mono instance is shown below. 1. I am now running on mono-4.4.0. The hanging issue still exists. 2. The code is compiled in VS 2015 and run under mono. 3. I changed the code in mono-threads.c (line 72) to use 10 seconds instead of the 200ms just to test if there was a timing issue in mono_threads_wait_pending_operations. You can see below that this timer expires as well. My guess is something is deadlocked waiting on a condition to occur that isn’t going to occur. Finally, can you give me an overall picture of what is occuring so I might be able to navigate the source a bit better. There are three threads in the backtrace. My best guess was that one is the GC, one is the Main Thread, and one is possibly a Finalizer, but I’m not sure. I’m digging through mono/utils/mono-threads.c and related code. Also, the CUE CARD state is 105, which according to the CUE CARD itself says : 0x?05 - async suspend requested (BAD) Under what cirumstances would an async suspend be requested in this context? Thanks again for your help with this issue. Cheers. strace of hung process/threads: ----------------------------------------------------------------------------------------------- ggeorge 64515 64382 64515 31 3 17:06 ? 00:08:37 mono --runtime=v4.0.30319 mycode.exe ggeorge 64515 64382 64574 0 3 17:06 ? 00:00:00 mono --runtime=v4.0.30319 mycode.exe ggeorge 64515 64382 64596 0 3 17:06 ? 00:00:06 mono --runtime=v4.0.30319 mycode.exe ggeorge 67340 67160 67340 0 1 17:33 pts/0 00:00:00 grep mono ggeorge@r8i2n17:~> strace -p 64515 Process 64515 attached - interrupt to quit futex(0x7fffedb05968, FUTEX_WAIT_PRIVATE, 2, NULL^C <unfinished ...> Process 64515 detached ggeorge@r8i2n17:~> strace -p 64382 Process 64382 attached - interrupt to quit wait4(-1, ^C <unfinished ...> Process 64382 detached ggeorge@r8i2n17:~> strace -p 64574 Process 64574 attached - interrupt to quit futex(0x97f604, FUTEX_WAIT_PRIVATE, 523, NULL^C <unfinished ...> Process 64574 detached ggeorge@r8i2n17:~> strace -p 64596 Process 64596 attached - interrupt to quit rt_sigsuspend(~[PWR RTMIN RT_1 RT_5]^C <unfinished ...> Process 64596 detached ggeorge@r8i2n17:~> gdb backtrace of hung process/threads: ----------------------------------------------------------------------------------------------- Thread 3 (Thread 0x7fffebfff700 (LWP 64574)): #0 0x00007fffecd3d66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000060db93 in mono_os_cond_wait (mutex=0x97f640 <lock>, cond=0x97f600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105 #2 thread_func (thread_data=0x0) at sgen-thread-pool.c:118 #3 0x00007fffecd39806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fffec87d9bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 2 (Thread 0x7fffd8093700 (LWP 64596)): #0 0x00007fffec7d1c8b in sigsuspend () from /lib64/libc.so.6 #1 0x000000000063e3e6 in suspend_signal_handler (_dummy=<optimized out>, info=<optimized out>, context=0x7fffd808ff80) at mono-threads-posix-signals.c:209 #2 <signal handler called> #3 0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2 #4 0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2 #5 0x00007fffed8ecca1 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2 #6 0x00007fffed8f7400 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #7 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #8 0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2 #9 0x00007fffecf4ff9b in dlopen_doit () from /lib64/libdl.so.2 #10 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #11 0x00007fffecf5033c in _dlerror_run () from /lib64/libdl.so.2 #12 0x00007fffecf4ff01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #13 0x00000000006327f5 in mono_dl_open_file (file=<optimized out>, flags=<optimized out>) at mono-dl-posix.c:67 #14 0x0000000000632029 in mono_dl_open (name=name@entry=0x1c604d0 "/p/home/apps/unsupported/NAVAIR/build/mono-4.4.0/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", flags=flags@entry=1, error_msg=error_msg@entry=0x7fffd8090e80) at mono-dl.c:150 #15 0x000000000054cf90 in cached_module_load (name=name@entry=0x1c604d0 "/p/home/apps/unsupported/NAVAIR/build/mono-4.4.0/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", err=err@entry=0x7fffd8090e80, flags=1) at loader.c:1402 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #16 0x000000000054e218 in mono_lookup_pinvoke_call (method=method@entry=, exc_class=exc_class@entry=0x7fffd8091f00, exc_arg=exc_arg@entry=0x7fffd8091f08) at loader.c:1645 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #17 0x00000000005641f6 in mono_marshal_get_native_wrapper (method=method@entry=, check_exceptions=check_exceptions@entry=1, aot=0) at marshal.c:7396 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #18 0x00000000004602cf in mono_method_to_ir (cfg=cfg@entry=0x1c5f920, method=method@entry=, start_bblock=<optimized out>, start_bblock@entry=0x0, end_bblock=<optimized out>, end_bblock@entry=0x0, return_var=return_var@entry=0x0, inline_args=inline_args@entry=0x0, inline_offset=0, is_virtual_call=0) at method-to-ir.c:9280 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #19 0x000000000050acf9 in mini_method_compile (method=method@entry=, opts=opts@entry=370239999, domain=domain@entry=0x9c91f0, flags=flags@entry=JIT_FLAG_RUN_CCTORS, parts=parts@entry=0, aot_method_index=aot_method_index@entry=-1) at mini.c:3608 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #20 0x000000000050c4d5 in mono_jit_compile_method_inner (method=method@entry=, target_domain=target_domain@entry=0x9c91f0, opt=opt@entry=370239999, jit_ex=jit_ex@entry=0x7fffd8092678) at mini.c:4263 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #21 0x00000000004287f8 in mono_jit_compile_method_with_opt (method=method@entry=, opt=370239999, ex=ex@entry=0x7fffd8092678) at mini-runtime.c:1952 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #22 0x0000000000428fbb in mono_jit_compile_method (method=) at mini-runtime.c:2008 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #23 0x00000000004aeb43 in common_call_trampoline_inner (regs=regs@entry=0x7fffd8092890, code=code@entry=0x401c3ee4 "\270\001", m=m@entry=, vt=vt@entry=0x0, vtable_slot=<optimized out>, vtable_slot@entry=0x0) at mini-trampolines.c:694 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #24 0x00000000004af2a0 in common_call_trampoline (regs=0x7fffd8092890, code=0x401c3ee4 "\270\001", m=, vt=0x0, vtable_slot=0x0) at mini-trampolines.c:808 #25 0x0000000040000289 in ?? () #26 0x0000000000a25145 in ?? () #27 0x00000000401c3ee4 in ?? () #28 0x0000000000a25145 in ?? () #29 0x00007fffd80929d0 in ?? () #30 0x00007fffd8092890 in ?? () #31 0x00007fffd8093698 in ?? () #32 0x00007fffd80d6188 in ?? () #33 0x00007fffd80d61a0 in ?? () #34 0x00007fffd8092af0 in ?? () #35 0x0000000000000003 in ?? () #36 0x00007fffd80929d0 in ?? () #37 0x00007fffd8092a60 in ?? () #38 0x0000000000000001 in ?? () #39 0x00007fffd80d6188 in ?? () #40 0x0000000000000000 in ?? () Thread 1 (Thread 0x7fffedae7780 (LWP 64515)): #0 0x00007fffecd40324 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fffecd3b684 in _L_lock_1091 () from /lib64/libpthread.so.0 #2 0x00007fffecd3b4f6 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2 #4 0x00007fffec8b5530 in do_dlopen () from /lib64/libc.so.6 #5 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #6 0x00007fffec8b55e5 in dlerror_run () from /lib64/libc.so.6 #7 0x00007fffec8b56d7 in __libc_dlopen_mode () from /lib64/libc.so.6 #8 0x00007fffec8902e5 in init () from /lib64/libc.so.6 #9 0x00007fffecd3ed03 in pthread_once () from /lib64/libpthread.so.0 #10 0x00007fffec89043c in backtrace () from /lib64/libc.so.6 #11 0x00000000004ad425 in mono_handle_native_sigsegv (signal=<optimized out>, ctx=<optimized out>, info=<optimized out>) at mini-exceptions.c:2309 #12 <signal handler called> #13 0x00007fffec7d1875 in raise () from /lib64/libc.so.6 #14 0x00007fffec7d2e51 in abort () from /lib64/libc.so.6 #15 0x00000000006468be in monoeg_log_default_handler (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, message=0x1c70ac0 "suspend_thread suspend took 10000 ms, which is more than the allowed 10000 ms", unused_data=0x0) at goutput.c:233 #16 0x00000000006466ab in monoeg_g_logv (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=0x703240 "suspend_thread suspend took %d ms, which is more than the allowed %d ms", args=0x7fffffffc588) at goutput.c:113 #17 0x0000000000646761 in monoeg_g_log (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=0x703240 "suspend_thread suspend took %d ms, which is more than the allowed %d ms") at goutput.c:123 #18 0x000000000063b77f in mono_threads_wait_pending_operations () at mono-threads.c:236 #19 0x000000000063bf0d in suspend_sync (interrupt_kernel=1, tid=140736817870592) at mono-threads.c:875 #20 suspend_sync_nolock (interrupt_kernel=1, id=140736817870592) at mono-threads.c:890 #21 mono_thread_info_safe_suspend_and_run (id=140736817870592, interrupt_kernel=interrupt_kernel@entry=1, callback=callback@entry=0x58ead0 <abort_thread_critical>, user_data=user_data@entry=0x7fffffffcb10) at mono-threads.c:933 #22 0x0000000000592f96 in abort_thread_internal (thread=thread@entry=0x7fffd813c230, install_async_abort=install_async_abort@entry=1, can_raise_exception=1) at threads.c:4728 #23 0x0000000000593039 in mono_thread_internal_stop (thread=0x7fffd813c230) at threads.c:2385 #24 0x00000000005b266e in mono_gc_cleanup () at gc.c:842 #25 0x00000000005abfbe in mono_runtime_cleanup (domain=domain@entry=0x9c91f0) at appdomain.c:356 #26 0x000000000042702b in mini_cleanup (domain=0x9c91f0) at mini-runtime.c:4017 #27 0x0000000000480dc6 in mono_main (argc=11, argv=<optimized out>) at driver.c:2119 #28 0x0000000000424cdf in mono_main_with_options (argv=0x7fffffffce78, argc=11) at main.c:28 #29 main (argc=11, argv=0x7fffffffce78) at main.c:177 (gdb) STATE CUE CARD of above process: ----------------------------------------------------------------------------------------------- STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number) 0x0 - starting (GOOD, unless the thread is running managed code) 0x1 - running (BAD, unless it's the gc thread) 0x2 - detached (GOOD, unless the thread is running managed code) 0x?03 - async suspended (GOOD) 0x?04 - self suspended (GOOD) 0x?05 - async suspend requested (BAD) 0x?06 - self suspend requested (BAD) 0x*07 - blocking (GOOD) 0x?08 - blocking with pending suspend (GOOD) --thread 0xa27f30 id 0x7fffd8093700 [(nil)] state 105 --thread 0x9c7a30 id 0x7fffedae7780 [(nil)] state 1 GC INITIATOR WAITING for 1 threads, got 0 suspended suspend_thread suspend took 10000 ms, which is more than the allowed 10000 ms Shutting down finalizer thread timed out. * Assertion at gc.c:867, condition `finalizer_thread_exited' not met From: Rodrigo Kumpera <kump...@gmail.com<mailto:kump...@gmail.com>> Date: Tuesday, June 7, 2016 at 2:07 PM To: Glover George <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> Subject: Re: [Mono-dev] FW: Random hangs while running mono app There should be only one CUE card dump per crash. You posted a bunch of those. I need to know which CUR Card goes with with crash dump. -- Rodrigo On Tue, Jun 7, 2016 at 9:09 AM, George, Glover E ERDC-RDE-ITL-MS CIV <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> wrote: What do you mean by proper logs? Was the email I sent with the CUE cards and the stack trace not correct? — — — Glover E. George Computer Scientist Information Technology Laboratory US Army Engineer Research and Development Center Vicksburg, MS 39180 601-634-4730<tel:601-634-4730> From: Rodrigo Kumpera <kump...@gmail.com<mailto:kump...@gmail.com>> Date: Tuesday, June 7, 2016 at 11:07 AM To: Glover George <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> Subject: Re: [Mono-dev] FW: Random hangs while running mono app Not really, without a test case and proper logs, nothing much we can do. On Tue, Jun 7, 2016 at 9:02 AM, George, Glover E ERDC-RDE-ITL-MS CIV <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> wrote: Hi Rodrigo, To avoid spamming the list, I’m emailing you off list. Is there anything else I could be doing to help narrow down the problem? Cheers. — — — Glover E. George Computer Scientist Information Technology Laboratory US Army Engineer Research and Development Center Vicksburg, MS 39180 601-634-4730<tel:601-634-4730> From: Rodrigo Kumpera <kump...@gmail.com<mailto:kump...@gmail.com>> Date: Friday, June 3, 2016 at 3:02 PM To: Glover George <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> Cc: "mono-devel-list@lists.ximian.com<mailto:mono-devel-list@lists.ximian.com>" <mono-devel-list@lists.ximian.com<mailto:mono-devel-list@lists.ximian.com>> Subject: Re: [Mono-dev] FW: Random hangs while running mono app Hi, The cue card output on itself is not useful. I need the cue card output with the matched thread dump as that allows me to see what's going on on the problematic thread. -- Rodrigo On Fri, Jun 3, 2016 at 12:42 PM, George, Glover E ERDC-RDE-ITL-MS CIV <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> wrote: Hi Rodrigo, I didn¹t realize our previous emails were off thread. I was able to get the STAT CUE CARD output. See below: STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number) 0x0 - starting (GOOD, unless the thread is running managed code) 0x1 - running (BAD, unless it's the gc thread) 0x2 - detached (GOOD, unless the thread is running managed code) 0x?03 - async suspended (GOOD) 0x?04 - self suspended (GOOD) 0x?05 - async suspend requested (BAD) 0x?06 - self suspend requested (BAD) 0x*07 - blocking (GOOD) 0x?08 - blocking with pending suspend (GOOD) --thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105 --thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1 GC INITIATOR WAITING for 1 threads, got 0 suspended suspend_thread suspend took 200 ms, which is more than the allowed 200 ms STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number) 0x0 - starting (GOOD, unless the thread is running managed code) 0x1 - running (BAD, unless it's the gc thread) 0x2 - detached (GOOD, unless the thread is running managed code) 0x?03 - async suspended (GOOD) 0x?04 - self suspended (GOOD) 0x?05 - async suspend requested (BAD) 0x?06 - self suspend requested (BAD) 0x*07 - blocking (GOOD) 0x?08 - blocking with pending suspend (GOOD) --thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105 --thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1 GC INITIATOR WAITING for 1 threads, got 0 suspended suspend_thread suspend took 200 ms, which is more than the allowed 200 ms STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number) 0x0 - starting (GOOD, unless the thread is running managed code) 0x1 - running (BAD, unless it's the gc thread) 0x2 - detached (GOOD, unless the thread is running managed code) 0x?03 - async suspended (GOOD) 0x?04 - self suspended (GOOD) 0x?05 - async suspend requested (BAD) 0x?06 - self suspend requested (BAD) 0x*07 - blocking (GOOD) 0x?08 - blocking with pending suspend (GOOD) --thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105 --thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1 GC INITIATOR WAITING for 1 threads, got 0 suspended suspend_thread suspend took 200 ms, which is more than the allowed 200 ms STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number) 0x0 - starting (GOOD, unless the thread is running managed code) 0x1 - running (BAD, unless it's the gc thread) 0x2 - detached (GOOD, unless the thread is running managed code) 0x?03 - async suspended (GOOD) 0x?04 - self suspended (GOOD) 0x?05 - async suspend requested (BAD) 0x?06 - self suspend requested (BAD) 0x*07 - blocking (GOOD) 0x?08 - blocking with pending suspend (GOOD) --thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105 --thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1 GC INITIATOR WAITING for 1 threads, got 0 suspended suspend_thread suspend took 200 ms, which is more than the allowed 200 ms STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number) 0x0 - starting (GOOD, unless the thread is running managed code) 0x1 - running (BAD, unless it's the gc thread) 0x2 - detached (GOOD, unless the thread is running managed code) 0x?03 - async suspended (GOOD) 0x?04 - self suspended (GOOD) 0x?05 - async suspend requested (BAD) 0x?06 - self suspend requested (BAD) 0x*07 - blocking (GOOD) 0x?08 - blocking with pending suspend (GOOD) --thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105 --thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1 GC INITIATOR WAITING for 1 threads, got 0 suspended suspend_thread suspend took 200 ms, which is more than the allowed 200 ms STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number) 0x0 - starting (GOOD, unless the thread is running managed code) 0x1 - running (BAD, unless it's the gc thread) 0x2 - detached (GOOD, unless the thread is running managed code) 0x?03 - async suspended (GOOD) 0x?04 - self suspended (GOOD) 0x?05 - async suspend requested (BAD) 0x?06 - self suspend requested (BAD) 0x*07 - blocking (GOOD) 0x?08 - blocking with pending suspend (GOOD) --thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105 --thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1 GC INITIATOR WAITING for 1 threads, got 0 suspended suspend_thread suspend took 200 ms, which is more than the allowed 200 ms STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number) 0x0 - starting (GOOD, unless the thread is running managed code) 0x1 - running (BAD, unless it's the gc thread) 0x2 - detached (GOOD, unless the thread is running managed code) 0x?03 - async suspended (GOOD) 0x?04 - self suspended (GOOD) 0x?05 - async suspend requested (BAD) 0x?06 - self suspend requested (BAD) 0x*07 - blocking (GOOD) 0x?08 - blocking with pending suspend (GOOD) --thread 0xa38b30 id 0x7fffec637700 [(nil)] state 105 --thread 0x9d8640 id 0x7fffedae7780 [(nil)] state 1 GC INITIATOR ‹ ‹ ‹ Glover E. George Computer Scientist Information Technology Laboratory US Army Engineer Research and Development Center Vicksburg, MS 39180 601-634-4730<tel:601-634-4730> From: Rodrigo Kumpera <kump...@gmail.com<mailto:kump...@gmail.com>> Date: Friday, April 29, 2016 at 9:12 AM To: Glover George <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> Cc: "mono-devel-list@lists.ximian.com<mailto:mono-devel-list@lists.ximian.com>" <mono-devel-list@lists.ximian.com<mailto:mono-devel-list@lists.ximian.com>> Subject: Re: [Mono-dev] FW: Random hangs while running mono app This looks like a shutdown bug in mono. Do you have a reliable way to reproduce it? How loaded are the machines running your workload? On Thu, Apr 28, 2016 at 6:41 PM, George, Glover E ERDC-RDE-ITL-MS CIV <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> wrote: One more important point I forgot to mention, when this occurs, it is generally only a few mono instances out of the 500+ simultaneous mono instances. Most of them terminate. And more importantly, the mono instances that do hang actually complete their work! It seems to be when mono is trying to close down. This has occurred with mono 4.2.x, 4.3.x, 4.4.x, and current git. Happened today with this configuration: Mono JIT compiler version 4.3.2 (Stable 4.3.2.467/ba2e5e4 Tue Mar 15 11:44:06 CDT 2016) Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. BlockedBlockedBlockedBlockedBlockedwww.mono-project.comBlockedBlockedBlockedBlockedBlocked <BlockedBlockedBlockedBlockedBlockedhttp://www.mono-project.comBlockedBlockedBlockedBlockedBlocked> TLS: __thread SIGSEGV: altstack Notifications: epoll Architecture: amd64 Disabled: none Misc: bigarrays softdebug LLVM: supported, not enabled. GC: sgen From: <mono-devel-list-boun...@lists.ximian.com<mailto:mono-devel-list-boun...@lists.ximian.com>> on behalf of Glover George <glover.e.geo...@erdc.dren.mil<mailto:glover.e.geo...@erdc.dren.mil>> Date: Thursday, April 28, 2016 at 5:25 PM To: "mono-devel-list@lists.ximian.com<mailto:mono-devel-list@lists.ximian.com>" <mono-devel-list@lists.ximian.com<mailto:mono-devel-list@lists.ximian.com>> Subject: [Mono-dev] Random hangs while running mono app Hi all, Quick background: We¹re running mono on an HPC platform (SGI ICE-X / SUSE Enterprise Linux 11) with a Lustre filesystem. For several months, I¹ve noticed random hangs with our application. In a given batch job, we run 500+ instances of mono in parallel (no interprocess communication). I haven¹t been able to reliably reproduce the issue which leads me to believe it¹s a timing issue, and more specifically, I have reason to believe it may be an issue with the interaction of mono and the filesystem. Being a clustered file-system, the load on the backing stores can vary due to other users. When I notice that my job hasn¹t finished in a reasonable amount of time, I log into the compute node and do a ³ ps -efL | grep mono ³ which reveals: george 16728 16575 16728 6 3 16:13 ? 00:03:58 mono --runtime=v4.0.30319 /app/MyConsole.exe ggeorge 16728 16575 16768 0 3 16:13 ? 00:00:02 mono --runtime=v4.0.30319 /app/MyConsole.exe ggeorge 16728 16575 16815 0 3 16:13 ? 00:00:04 mono --runtime=v4.0.30319 /app/MyConsole.exe 16728 is the top proc which spawns 16768 and 16815 (main, gc and maybe finalizer thread?). Attaching to each of these 3 pid¹s with gdb and doing a backtrace reveals the traces below. 1. Is it possible this is a GC hang? 2. Is it possible this is a race condition that has anything to do with the filesystem? 3. Any idea how this sequence could occur? If there¹s a way to gather more information, please let me know. The reason for the file system suspicion is my app writes several small files, times the number of mono instances (500+). It only happens randomly, but seems to be more frequent the more mono instances I run. Any help is HIGHLY appreciated. PID 16728 ‹‹‹ (gdb) bt #0 0x00007fffecccd324 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fffeccc8684 in _L_lock_1091 () from /lib64/libpthread.so.0 #2 0x00007fffeccc84f6 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2 #4 0x00007fffec842530 in do_dlopen () from /lib64/libc.so.6 #5 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #6 0x00007fffec8425e5 in dlerror_run () from /lib64/libc.so.6 #7 0x00007fffec8426d7 in __libc_dlopen_mode () from /lib64/libc.so.6 #8 0x00007fffec81d2e5 in init () from /lib64/libc.so.6 #9 0x00007fffecccbd03 in pthread_once () from /lib64/libpthread.so.0 #10 0x00007fffec81d43c in backtrace () from /lib64/libc.so.6 #11 0x00000000004ac025 in mono_handle_native_sigsegv (signal=<optimized out>, ctx=<optimized out>, info=<optimized out>) at mini-exceptions.c:2309 #12 <signal handler called> #13 0x00007fffec75e875 in raise () from /lib64/libc.so.6 #14 0x00007fffec75fe51 in abort () from /lib64/libc.so.6 #15 0x000000000064528a in monoeg_log_default_handler (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, message= 0x1ac7660 "suspend_thread suspend took 200 ms, which is more than the allowed 200 ms", unused_data=0x0) at goutput.c:233 #16 0x0000000000645077 in monoeg_g_logv (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format= 0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms", args=0x7fffffffce58) at goutput.c:113 #17 0x000000000064512d in monoeg_g_log (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format= 0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms") at goutput.c:123 #18 0x000000000063a13f in mono_threads_wait_pending_operations () at mono-threads.c:238 #19 0x000000000063a8cd in suspend_sync (interrupt_kernel=1, tid=140737159329536) at mono-threads.c:877 #20 suspend_sync_nolock (interrupt_kernel=1, id=140737159329536) at mono-threads.c:892 #21 mono_thread_info_safe_suspend_and_run (id=140737159329536, interrupt_kernel=interrupt_kernel@entry=1, callback=callback@entry= 0x58d5c0 <abort_thread_critical>, user_data=user_data@entry=0x7fffffffd3e0) at mono-threads.c:935 #22 0x0000000000591a86 in abort_thread_internal (thread=thread@entry=0x7fffec6e0230, install_async_abort=install_async_abort@entry=1, can_raise_exception=1) at threads.c:4728 #23 0x0000000000591b29 in mono_thread_internal_stop (thread=0x7fffec6e0230) at threads.c:2385 ---Type <return> to continue, or q <return> to quit--- #24 0x00000000005b123e in mono_gc_cleanup () at gc.c:842 #25 0x00000000005aab8e in mono_runtime_cleanup (domain=domain@entry=0x9d9e00) at appdomain.c:356 #26 0x0000000000426c8b in mini_cleanup (domain=0x9d9e00) at mini-runtime.c:4017 #27 0x000000000047fac6 in mono_main (argc=11, argv=<optimized out>) at driver.c:2115 #28 0x0000000000424c68 in mono_main_with_options (argv=0x7fffffffd698, argc=11) at main.c:20 #29 main (argc=<optimized out>, argv=<optimized out>) at main.c:53 Thread 16768 ‹‹‹ (gdb) bt #0 0x00007fffeccca66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000060c873 in mono_os_cond_wait (mutex=0x97e640 <lock>, cond=0x97e600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105 #2 thread_func (thread_data=0x0) at sgen-thread-pool.c:118 #3 0x00007fffeccc6806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fffec80a9bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) Thread 16815 ‹‹‹‹ Thread #0 0x00007fffec75ec8b in sigsuspend () from /lib64/libc.so.6 #1 0x000000000063cda6 in suspend_signal_handler (_dummy=<optimized out>, info=<optimized out>, context=0x7fffec633dc0) at mono-threads-posix-signals.c:209 #2 <signal handler called> #3 0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2 #4 0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2 #5 0x00007fffed8eade0 in open_path () from /lib64/ld-linux-x86-64.so.2 #6 0x00007fffed8ece7c in _dl_map_object () from /lib64/ld-linux-x86-64.so.2 #7 0x00007fffed8f7400 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #8 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #9 0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2 #10 0x00007fffecedcf9b in dlopen_doit () from /lib64/libdl.so.2 #11 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #12 0x00007fffecedd33c in _dlerror_run () from /lib64/libdl.so.2 #13 0x00007fffecedcf01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #14 0x0000000000630b79 in mono_dl_open (name=name@entry=0x1c02790 "libSystem.Data", flags=flags@entry=1, error_msg=error_msg@entry= 0x7fffec634e80) at mono-dl.c:150 #15 0x000000000054b9f0 in cached_module_load (name=name@entry=0x1c02790 "libSystem.Data", err=err@entry=0x7fffec634e80, flags=1) at loader.c:1398 _______________________________________________ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com<mailto:Mono-devel-list@lists.ximian.com> Blockedhttp://lists.ximian.com/mailman/listinfo/mono-devel-listBlocked
_______________________________________________ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list