I¹ve been trying to reproduce the case where my jobs get a core dump when the finalizer doesn¹t return in time, but I¹ve only been able to reproduce the previously posted stack trace which causes the job to hang in ³Sl² state (using ps). Posting the stack trace again because I somehow didn¹t pasted on of the threads entire stacktrace:
Rodrigo, can you please follow up on the email I posted about not seeing anything in STDOUT/STDERR? My code doesn¹t seem to make it to where you are referring to. Burkhard, what file system are you guys using on your cluster? NFS, Gluster, Lustre? (gdb) thread apply all bt Thread 3 (Thread 0x7fffebfff700 (LWP 2269)): #0 0x00007fffeccca66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000060c873 in mono_os_cond_wait (mutex=0x97e640 <lock>, cond=0x97e600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105 #2 thread_func (thread_data=0x0) at sgen-thread-pool.c:118 #3 0x00007fffeccc6806 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fffec80a9bd in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () Thread 2 (Thread 0x7fffec637700 (LWP 2272)): #0 0x00007fffec75ec8b in sigsuspend () from /lib64/libc.so.6 #1 0x000000000063cda6 in suspend_signal_handler (_dummy=<optimized out>, info=<optimized out>, context=0x7fffec633f80) at mono-threads-posix-signals.c:209 #2 <signal handler called> #3 0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2 #4 0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2 #5 0x00007fffed8ecca1 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2 #6 0x00007fffed8f7400 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #7 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #8 0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2 #9 0x00007fffecedcf9b in dlopen_doit () from /lib64/libdl.so.2 #10 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #11 0x00007fffecedd33c in _dlerror_run () from /lib64/libdl.so.2 #12 0x00007fffecedcf01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #13 0x0000000000631345 in mono_dl_open_file (file=<optimized out>, flags=<optimized out>) at mono-dl-posix.c:67 #14 0x0000000000630b79 in mono_dl_open (name=name@entry=0x19839c0 "/p/home/apps/unsupported/NAVAIR/build/mono-4.3.2/lib/libSystem.Data.dll.so ", flags=flags@entry=1, error_msg=error_msg@entry=0x7fffec634e80) at mono-dl.c:150 #15 0x000000000054b9f0 in cached_module_load (name=name@entry=0x19839c0 "/p/home/apps/unsupported/NAVAIR/build/mono-4.3.2/lib/libSystem.Data.dll.so ", err=err@entry=0x7fffec634e80, flags=1) at loader.c:1398 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #16 0x000000000054cc78 in mono_lookup_pinvoke_call (method=method@entry=, exc_class=exc_class@entry=0x7fffec635f00, exc_arg=exc_arg@entry=0x7fffec635f08) at loader.c:1641 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #17 0x0000000000562ce6 in mono_marshal_get_native_wrapper (method=method@entry=, check_exceptions=check_exceptions@entry=1, aot=0) at marshal.c:7396 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #18 0x0000000000452912 in mono_method_to_ir (cfg=cfg@entry=0x1984120, method=method@entry=, start_bblock=<optimized out>, start_bblock@entry=0x0, end_bblock=<optimized out>, end_bblock@entry=0x0, return_var=return_var@entry=0x0, inline_args=inline_args@entry=0x0, inline_offset=0, is_virtual_call=0) at method-to-ir.c:9280 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #19 0x00000000005097d9 in mini_method_compile (method=method@entry=, opts=opts@entry=370239999, domain=domain@entry=0x9d9e00, flags=flags@entry=JIT_FLAG_RUN_CCTORS, parts=parts@entry=0, aot_method_index=aot_method_index@entry=-1) at mini.c:3608 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #20 0x000000000050afb5 in mono_jit_compile_method_inner (method=method@entry=, target_domain=target_domain@entry=0x9d9e00, opt=opt@entry=370239999, jit_ex=jit_ex@entry=0x7fffec636678) at mini.c:4263 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #21 0x0000000000428458 in mono_jit_compile_method_with_opt (method=method@entry=, opt=370239999, ex=ex@entry=0x7fffec636678) at mini-runtime.c:1952 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #22 0x0000000000428c1b in mono_jit_compile_method (method=) at mini-runtime.c:2008 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #23 0x00000000004ad743 in common_call_trampoline_inner (regs=regs@entry=0x7fffec636890, code=code@entry=0x40244e34 "\270\001", m=m@entry=, vt=vt@entry=0x0, vtable_slot=<optimized out>, vtable_slot@entry=0x0) at mini-trampolines.c:694 Python Exception <type 'exceptions.ValueError'> zero length field name in format: #24 0x00000000004adea0 in common_call_trampoline (regs=0x7fffec636890, code=0x40244e34 "\270\001", m=, vt=0x0, vtable_slot=0x0) at mini-trampolines.c:808 #25 0x0000000040000289 in ?? () #26 0x0000000000a35cc5 in ?? () #27 0x0000000040244e34 in ?? () #28 0x0000000000a35cc5 in ?? () #29 0x00007fffec6369d0 in ?? () #30 0x00007fffec636890 in ?? () #31 0x00007fffec637698 in ?? () #32 0x00007fffec67a188 in ?? () #33 0x00007fffec67a1a0 in ?? () #34 0x00007fffec636af0 in ?? () #35 0x0000000000000003 in ?? () #36 0x00007fffec6369d0 in ?? () #37 0x00007fffec636a60 in ?? () #38 0x0000000000000001 in ?? () #39 0x00007fffec67a188 in ?? () #40 0x0000000000000000 in ?? () Thread 1 (Thread 0x7fffedae7780 (LWP 2226)): #0 0x00007fffecccd324 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fffeccc8684 in _L_lock_1091 () from /lib64/libpthread.so.0 #2 0x00007fffeccc84f6 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2 #4 0x00007fffec842530 in do_dlopen () from /lib64/libc.so.6 #5 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #6 0x00007fffec8425e5 in dlerror_run () from /lib64/libc.so.6 #7 0x00007fffec8426d7 in __libc_dlopen_mode () from /lib64/libc.so.6 #8 0x00007fffec81d2e5 in init () from /lib64/libc.so.6 #9 0x00007fffecccbd03 in pthread_once () from /lib64/libpthread.so.0 #10 0x00007fffec81d43c in backtrace () from /lib64/libc.so.6 #11 0x00000000004ac025 in mono_handle_native_sigsegv (signal=<optimized out>, ctx=<optimized out>, info=<optimized out>) at mini-exceptions.c:2309 #12 <signal handler called> #13 0x00007fffec75e875 in raise () from /lib64/libc.so.6 #14 0x00007fffec75fe51 in abort () from /lib64/libc.so.6 #15 0x000000000064528a in monoeg_log_default_handler (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, message=0x17b4f20 "suspend_thread suspend took 200 ms, which is more than the allowed 200 ms", unused_data=0x0) at goutput.c:233 #16 0x0000000000645077 in monoeg_g_logv (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms", args=0x7fffffffce48) at goutput.c:113 #17 0x000000000064512d in monoeg_g_log (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=0x7015d8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms") at goutput.c:123 #18 0x000000000063a13f in mono_threads_wait_pending_operations () at mono-threads.c:238 #19 0x000000000063a8cd in suspend_sync (interrupt_kernel=1, tid=140737159329536) at mono-threads.c:877 #20 suspend_sync_nolock (interrupt_kernel=1, id=140737159329536) at mono-threads.c:892 #21 mono_thread_info_safe_suspend_and_run (id=140737159329536, interrupt_kernel=interrupt_kernel@entry=1, callback=callback@entry=0x58d5c0 <abort_thread_critical>, user_data=user_data@entry=0x7fffffffd3d0) at mono-threads.c:935 #22 0x0000000000591a86 in abort_thread_internal (thread=thread@entry=0x7fffec6e0230, install_async_abort=install_async_abort@entry=1, can_raise_exception=1) at threads.c:4728 #23 0x0000000000591b29 in mono_thread_internal_stop (thread=0x7fffec6e0230) at threads.c:2385 #24 0x00000000005b123e in mono_gc_cleanup () at gc.c:842 #25 0x00000000005aab8e in mono_runtime_cleanup (domain=domain@entry=0x9d9e00) at appdomain.c:356 #26 0x0000000000426c8b in mini_cleanup (domain=0x9d9e00) at mini-runtime.c:4017 #27 0x000000000047fac6 in mono_main (argc=11, argv=<optimized out>) at driver.c:2115 #28 0x0000000000424c68 in mono_main_with_options (argv=0x7fffffffd688, argc=11) at main.c:20 #29 main (argc=<optimized out>, argv=<optimized out>) at main.c:53 ‹ ‹ ‹ Glover E. George Computer Scientist Information Technology Laboratory US Army Engineer Research and Development Center Vicksburg, MS 39180 601-634-4730 On 6/2/16, 7:34 AM, "mono-devel-list-boun...@lists.ximian.com on behalf of Burkhard Linke" <mono-devel-list-boun...@lists.ximian.com on behalf of bli...@cebitec.uni-bielefeld.de> wrote: >Hi, > >any updates on this? The bug affects the latest stable packages in the >official xamarin repository, and nightly builds or building from source >are not options. > >Regards, >Burkhard > >On 05/19/2016 04:30 PM, Burkhard Linke wrote: >> Hi, >> >> On 04/29/2016 04:12 PM, Rodrigo Kumpera wrote: >>> This looks like a shutdown bug in mono. >>> >>> Do you have a reliable way to reproduce it? >>> How loaded are the machines running your workload? >> >> We have encountered the same(?) bug on our compute cluster. >> Applications process data, write output files, but do not terminate. >> >> (gdb) info threads >> Id Target Id Frame >> 6 Thread 0x2b1f83200700 (LWP 63141) "mono" >> pthread_cond_wait@@GLIBC_2.3.2 () at >> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 >> 5 Thread 0x2b1f84cf3700 (LWP 63142) "Finalizer" sem_wait () at >> ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 >> 4 Thread 0x2b1f87ee1700 (LWP 63143) "mono" >> pthread_cond_timedwait@@GLIBC_2.3.2 () at >> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 >> 3 Thread 0x2b1f8c81d700 (LWP 63148) "Timer-Scheduler" >> pthread_cond_wait@@GLIBC_2.3.2 () at >> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 >> 2 Thread 0x2b1fe1133700 (LWP 63248) "mono" >> pthread_cond_wait@@GLIBC_2.3.2 () at >> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 >> * 1 Thread 0x2b1f81c98580 (LWP 63140) "mono" >> pthread_cond_wait@@GLIBC_2.3.2 () at >> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 >> (gdb) thread apply all bt >> >> Thread 6 (Thread 0x2b1f83200700 (LWP 63141)): >> #0 pthread_cond_wait@@GLIBC_2.3.2 () at >> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 >> #1 0x00000000005f9aec in ?? () >> #2 0x00002b1f8259b182 in start_thread (arg=0x2b1f83200700) at >> pthread_create.c:312 >> #3 0x00002b1f828ab47d in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >> >> Thread 5 (Thread 0x2b1f84cf3700 (LWP 63142)): >> #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 >> #1 0x000000000061de28 in mono_sem_wait () >> #2 0x00000000005a2076 in ?? () >> #3 0x00000000005843d3 in ?? () >> #4 0x0000000000624666 in ?? () >> #5 0x00002b1f8259b182 in start_thread (arg=0x2b1f84cf3700) at >> pthread_create.c:312 >> #6 0x00002b1f828ab47d in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >> >> Thread 4 (Thread 0x2b1f87ee1700 (LWP 63143)): >> #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at >> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 >> #1 0x00002b1f867ce29c in cl_thread_wait_for_thread_condition () from >> /usr/lib/gridengine-drmaa/lib/libdrmaa.so >> #2 0x00002b1f867ce6d3 in cl_thread_wait_for_event () from >> /usr/lib/gridengine-drmaa/lib/libdrmaa.so >> #3 0x00002b1f867b297f in ?? () from >> /usr/lib/gridengine-drmaa/lib/libdrmaa.so >> #4 0x00002b1f8259b182 in start_thread (arg=0x2b1f87ee1700) at >> pthread_create.c:312 >> #5 0x00002b1f828ab47d in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >> >> Thread 3 (Thread 0x2b1f8c81d700 (LWP 63148)): >> #0 pthread_cond_wait@@GLIBC_2.3.2 () at >> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 >> #1 0x00000000005fef47 in ?? () >> #2 0x000000000061101b in ?? () >> #3 0x000000000058415e in ?? () >> #4 0x0000000000585309 in ?? () >> #5 0x0000000041806ecd in ?? () >> #6 0x00002b1f90004990 in ?? () >> #7 0xffffffffffffffff in ?? () >> #8 0x7fffffffffffffff in ?? () >> #9 0x00002b1f82e1b1b0 in ?? () >> #10 0xffffffffffffffff in ?? () >> #11 0x00002b1f90004880 in ?? () >> #12 0x0000000041806e4a in ?? () >> #13 0x00002b1f8c81c780 in ?? () >> #14 0x00002b1f8c81c6f0 in ?? () >> /build/buildd/gdb-7.7.1/gdb/dwarf2-frame.c:692: internal-error: >> Unknown CFI encountered. >> A problem internal to GDB has been detected, >> further debugging may prove unreliable. >> Quit this debugging session? (y or n) >> >> (The gbd crash might or might not be part of the problem). >> >> OS is Ubuntu 14.04, with mono from the xamarin repositories: >> # mono --version >> Mono JIT compiler version 4.2.3 (Stable 4.2.3.4/832de4b Wed Mar 16 >> 13:19:08 UTC 2016) >> Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. >> Blockedwww.mono-project.comBlocked >> TLS: __thread >> SIGSEGV: altstack >> Notifications: epoll >> Architecture: amd64 >> Disabled: none >> Misc: softdebug >> LLVM: supported, not enabled. >> GC: sgen >> >> The process is still running if you need further debugging >> information. The problem does not affect all instance, but about 20%. >> It is thus cannot be reproduced reliably. >> >> Regards, >> Burkhard >> _______________________________________________ >> Mono-devel-list mailing list >> Mono-devel-list@lists.ximian.com >> Blockedhttp://lists.ximian.com/mailman/listinfo/mono-devel-listBlocked > >_______________________________________________ >Mono-devel-list mailing list >Mono-devel-list@lists.ximian.com >Blockedhttp://lists.ximian.com/mailman/listinfo/mono-devel-listBlocked > _______________________________________________ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list