Hi all, Quick background: We’re running mono on an HPC platform (SGI ICE-X / SUSE Enterprise Linux 11) with a Lustre filesystem. For several months, I’ve noticed random hangs with our application. In a given batch job, we run 500+ instances of mono in parallel (no interprocess communication). I haven’t been able to reliably reproduce the issue which leads me to believe it’s a timing issue, and more specifically, I have reason to believe it may be an issue with the interaction of mono and the filesystem. Being a clustered file-system, the load on the backing stores can vary due to other users.
When I notice that my job hasn’t finished in a reasonable amount of time, I log into the compute node and do a “ ps -efL | grep mono “ which reveals: george 16728 16575 16728 6 3 16:13 ? 00:03:58 mono --runtime=v4.0.30319 /app/MyConsole.exe ggeorge 16728 16575 16768 0 3 16:13 ? 00:00:02 mono --runtime=v4.0.30319 /app/MyConsole.exe ggeorge 16728 16575 16815 0 3 16:13 ? 00:00:04 mono --runtime=v4.0.30319 /app/MyConsole.exe 16728 is the top proc which spawns 16768 and 16815 (main, gc and maybe finalizer thread?). Attaching to each of these 3 pid’s with gdb and doing a backtrace reveals the traces below. 1. Is it possible this is a GC hang? 2. Is it possible this is a race condition that has anything to do with the filesystem? 3. Any idea how this sequence could occur? If there’s a way to gather more information, please let me know. The reason for the file system suspicion is my app writes several small files, times the number of mono instances (500+). It only happens randomly, but seems to be more frequent the more mono instances I run. Any help is HIGHLY appreciated. PID 16728 ——— (gdb) bt #0 0x00007fffecccd324 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fffeccc8684 in _L_lock_1091 () from /lib64/libpthread.so.0 #2 0x00007fffeccc84f6 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2 #4 0x00007fffec842530 in do_dlopen () from /lib64/libc.so.6 #5 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #6 0x00007fffec8425e5 in dlerror_run () from /lib64/libc.so.6 #7 0x00007fffec8426d7 in __libc_dlopen_mode () from /lib64/libc.so.6 #8 0x00007fffec81d2e5 in init () from /lib64/libc.so.6 #9 0x00007fffecccbd03 in pthread_once () from /lib64/libpthread.so.0 #10 0x00007fffec81d43c in backtrace () from /lib64/libc.so.6 #11 0x00000000004ac025 in mono_handle_native_sigsegv (signal=<optimized out>, ctx=<optimized out>, info=<optimized out>) at mini-exceptions.c:2309 #12 <signal handler called> #13 0x00007fffec75e875 in raise () from /lib64/libc.so.6 #14 0x00007fffec75fe51 in abort () from /lib64/libc.so.6 #15 0x000000000064528a in monoeg_log_default_handler (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, message= 0x1ac7660 "suspend_thread suspend took 200 ms, which is more than the allowed 200 ms", unused_data=0x0) at goutput.c:233 #16 0x0000000000645077 in monoeg_g_logv (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format= 0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms", args=0x7fffffffce58) at goutput.c:113 #17 0x000000000064512d in monoeg_g_log (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format= 0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms") at goutput.c:123 #18 0x000000000063a13f in mono_threads_wait_pending_operations () at mono-threads.c:238 #19 0x000000000063a8cd in suspend_sync (interrupt_kernel=1, tid=140737159329536) at mono-threads.c:877 #20 suspend_sync_nolock (interrupt_kernel=1, id=140737159329536) at mono-threads.c:892 #21 mono_thread_info_safe_suspend_and_run (id=140737159329536, interrupt_kernel=interrupt_kernel@entry=1, callback=callback@entry= 0x58d5c0 <abort_thread_critical>, user_data=user_data@entry=0x7fffffffd3e0) at mono-threads.c:935 #22 0x0000000000591a86 in abort_thread_internal (thread=thread@entry=0x7fffec6e0230, install_async_abort=install_async_abort@entry=1, can_raise_exception=1) at threads.c:4728 #23 0x0000000000591b29 in mono_thread_internal_stop (thread=0x7fffec6e0230) at threads.c:2385 ---Type <return> to continue, or q <return> to quit--- #24 0x00000000005b123e in mono_gc_cleanup () at gc.c:842 #25 0x00000000005aab8e in mono_runtime_cleanup (domain=domain@entry=0x9d9e00) at appdomain.c:356 #26 0x0000000000426c8b in mini_cleanup (domain=0x9d9e00) at mini-runtime.c:4017 #27 0x000000000047fac6 in mono_main (argc=11, argv=<optimized out>) at driver.c:2115 #28 0x0000000000424c68 in mono_main_with_options (argv=0x7fffffffd698, argc=11) at main.c:20 #29 main (argc=<optimized out>, argv=<optimized out>) at main.c:53 Thread 16768 ——— (gdb) bt #0 0x00007fffeccca66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000060c873 in mono_os_cond_wait (mutex=0x97e640 <lock>, cond=0x97e600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105 #2 thread_func (thread_data=0x0) at sgen-thread-pool.c:118 #3 0x00007fffeccc6806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fffec80a9bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) Thread 16815 ———— Thread #0 0x00007fffec75ec8b in sigsuspend () from /lib64/libc.so.6 #1 0x000000000063cda6 in suspend_signal_handler (_dummy=<optimized out>, info=<optimized out>, context=0x7fffec633dc0) at mono-threads-posix-signals.c:209 #2 <signal handler called> #3 0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2 #4 0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2 #5 0x00007fffed8eade0 in open_path () from /lib64/ld-linux-x86-64.so.2 #6 0x00007fffed8ece7c in _dl_map_object () from /lib64/ld-linux-x86-64.so.2 #7 0x00007fffed8f7400 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #8 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #9 0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2 #10 0x00007fffecedcf9b in dlopen_doit () from /lib64/libdl.so.2 #11 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #12 0x00007fffecedd33c in _dlerror_run () from /lib64/libdl.so.2 #13 0x00007fffecedcf01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #14 0x0000000000630b79 in mono_dl_open (name=name@entry=0x1c02790 "libSystem.Data", flags=flags@entry=1, error_msg=error_msg@entry= 0x7fffec634e80) at mono-dl.c:150 #15 0x000000000054b9f0 in cached_module_load (name=name@entry=0x1c02790 "libSystem.Data", err=err@entry=0x7fffec634e80, flags=1) at loader.c:1398
_______________________________________________ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list