Thanks for the reply Azat, After a long investigation we discovered a bad memory module in the ESX server. Once that stick was replaced, we never saw the problem again. The ESX was running about 15,000 instances of our app in a test environment and we would see the crash on just a few every night. Never saw any other app or VM os crash, so I guess our app was the only one hitting that bad memory (statistically makes sense). Thanks memtestx86.
-Bill On Sat, Oct 31, 2015 at 12:36 PM, Azat Khuzhin <[email protected]> wrote: > On Wed, Feb 25, 2015 at 09:25:03AM -0500, Bill Vaughan wrote: > > Hi all, > > > > We have been seeing a fairly hard to reproduce problem with crashes > inside > > of libevent in our overnight scale tests. We run 15,000 clients in an > ESX > > environment and have been getting one or two crashes per night. We are > > running libevent 2.0.21 on this test system. We have moved to 2.0.22 on > our > > development branch. > > > > The application is relatively large and complex and it is quite possible > we > > have a memory trampler, but valgrind tests so far are clean and code > > inspection hasn't found anything. > > What valgrind tools do you use? > > AFAIK valgrind (at least memcheck) limited to one thread, i.e. it > emulates multi-threading (and you have multi-threaded app): > > http://valgrind.org/docs/manual/manual-core.html#manual-core.pthreads_perf_sched > The main thing to point out with respect to threaded programs is > that your program will use the native threading library, but > Valgrind serialises execution so that only one (kernel) thread is > running at a time. This approach avoids the horrible implementation > problems of implementing a truly multithreaded version of Valgrind, > but it does mean that threaded apps never use more than one CPU > simultaneously, even if you have a multiprocessor or multicore > machine. > > Can you try to use gcc/clang asan/tsan instead: > $ gcc -fsanitize=thread ... > $ gcc -fsanitize=address ... > > > We recently ran the overnight test with libevent debugging enabled > > (event_enable_debug_mode(), evthread_enable_lock_debuging()) and had the > > following assert in the libevent debug hash table: > > > > > > [err] event_del_internal: noting a del on a non-setup event 0xb56248a0 > > (events: 0x0, fd: -1, flags: 0x80) > > === Command detached from window (Mon Feb 23 23:13:14 2015) === > > === Command terminated with signal 6 (core file generated) (Mon Feb 23 > > 23:13:14 2015) === > > > > > > Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1". > > Core was generated by `./vtc_app -Script-Client29-333'. > > Program terminated with signal 6, Aborted. > > #0 0xb775e424 in __kernel_vsyscall () > > (gdb) where > > #0 0xb775e424 in __kernel_vsyscall () > > #1 0xb747a1df in raise () from /lib/i386-linux-gnu/libc.so.6 > > #2 0xb747d825 in abort () from /lib/i386-linux-gnu/libc.so.6 > > #3 0x0814bc31 in event_exit (errcode=-559030611) at log.c:79 > > #4 0x0814bd53 in event_errx (eval=-559030611, > > fmt=0x82105b0 "%s: noting a del on a non-setup event %p (events: > 0x%x, > > fd: %d, flags: 0x%x)") at log.c:136 > > #5 0x0813dc07 in event_del_internal (ev=<optimized out>) at event.c:2265 > > #6 event_process_active_single_queue (activeq=<optimized out>, > > base=<optimized out>) at event.c:1324 > > #7 event_process_active (base=<optimized out>) at event.c:1420 > > #8 event_base_loop (base=0xb6a09d58, flags=2) at event.c:1621 > > #9 0x08073960 in VtcCommMgr::ThreadLoop (this=0xacdcc008) at > > vtc_app/vtc_comm_mgr.cpp:844 > > #10 0xb772cd4c in start_thread () from > /lib/i386-linux-gnu/libpthread.so.0 > > #11 0xb753ad3e in clone () from /lib/i386-linux-gnu/libc.so.6 > > > > > > So looks like an event fired that was not in the hash table, which should > > only happen if the event was freed, from what I can tell. This points to > a > > clear application bug, however, there is a bit more to the store. > > > > Examining the libevent hash table that tracks the events, we found this: > > > > == 667 0xb5e248f0 == > > {node = {hte_next = 0x0}, ptr = 0x0, added = 0} > > global_debug_map.hth_table[666].node.hte_next > > == 668 0xb5600858 == > > {node = {hte_next = 0x0}, ptr = 0xb5600808, added = 1} > > global_debug_map.hth_table[667].node.hte_next > > == 669 0xb5624950 == > > {node = {hte_next = 0x0}, ptr = 0xb5624900, added = 1} > > global_debug_map.hth_table[668].node.hte_next > > == 676 0x97a5390 == > > {node = {hte_next = 0x0}, ptr = 0x97a50a8, added = 0} > > global_debug_map.hth_table[669].node.hte_next > > > > > > > > Bucket 667 has an entry, but the pointer to the event is 0. The hash > entry > > should be deleted (mm_free()) when the event is freed, and have a valid > > event pointer otherwise, so the event likely was not freed. All of the > > other entries in the hash look good. The event that is currently being > > If you are writing in the area of memory into which you must not write, > then you can't trust all this, since this can be a side effect, not the > core issue. > > > processed in the backtrace, 0xb56248a0, maps to bucket 667 based on the > > hash algorithm: ((0xb56248a0 >> 6) % 769) == 667, which is our bucket. > > This appears to be some sort of corruption if I am understanding the hash > > table logic correctly. > > > > The event base was good and our application only calls event_free() and > not > > any of the other APIs that could free an event from the hash > > (event_reinit() and a couple others). > > > > I see in the 2.0.22 release notes that a race was fixed in > event_active(). > > Do you think that could cause this? Our next step is to move to 2.0.22 on > > our test system to see if the problem still happens. However, this branch > > of code is very close to a release, so we are very cautious about > changing > > things without a clear cause and effect. > > Not sure, but of course this could impact. > And since it's been awhile maybe you already tried 2.0.22? > > > We have had several engineers reviewing the application code and so far > > have not seen any invalid libevent usages, and as I said, valgrind is > > clean. Clearly we are doing something wrong in the application or have > > found some sort of edge case within the library. And libevent 2.0.21 has > > been very stable as I have used it on this project and others. > > > > The application is complicated enough that I can't easily provide code > > snippets that are meaningful, and since we are not sure how to reproduce > > the problem, I don't have any test code that can reproduce it. > > > > All of the clients are running under Ubuntu 12.04. > > > > Has anyone else seen anything like this? > > Sorry for such a long delay. > *********************************************************************** > To unsubscribe, send an e-mail to [email protected] with > unsubscribe libevent-users in the body. >
