Hi all,

We have been seeing a fairly hard to reproduce problem with crashes inside
of libevent in our overnight scale tests.  We run 15,000 clients in an  ESX
environment and have been getting one or two crashes per night. We are
running libevent 2.0.21 on this test system. We have moved to 2.0.22 on our
development branch.

The application is relatively large and complex and it is quite possible we
have a memory trampler, but valgrind tests so far are clean and code
inspection hasn't found anything.

We recently ran the overnight test with libevent debugging enabled
(event_enable_debug_mode(), evthread_enable_lock_debuging()) and had the
following assert in the libevent debug hash table:


[err] event_del_internal: noting a del on a non-setup event 0xb56248a0
(events: 0x0, fd: -1, flags: 0x80)
=== Command detached from window (Mon Feb 23 23:13:14 2015) ===
=== Command terminated with signal 6 (core file generated) (Mon Feb 23
23:13:14 2015) ===


Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
Core was generated by `./vtc_app -Script-Client29-333'.
Program terminated with signal 6, Aborted.
#0  0xb775e424 in __kernel_vsyscall ()
(gdb) where
#0  0xb775e424 in __kernel_vsyscall ()
#1  0xb747a1df in raise () from /lib/i386-linux-gnu/libc.so.6
#2  0xb747d825 in abort () from /lib/i386-linux-gnu/libc.so.6
#3  0x0814bc31 in event_exit (errcode=-559030611) at log.c:79
#4  0x0814bd53 in event_errx (eval=-559030611,
    fmt=0x82105b0 "%s: noting a del on a non-setup event %p (events: 0x%x,
fd: %d, flags: 0x%x)") at log.c:136
#5  0x0813dc07 in event_del_internal (ev=<optimized out>) at event.c:2265
#6  event_process_active_single_queue (activeq=<optimized out>,
base=<optimized out>) at event.c:1324
#7  event_process_active (base=<optimized out>) at event.c:1420
#8  event_base_loop (base=0xb6a09d58, flags=2) at event.c:1621
#9  0x08073960 in VtcCommMgr::ThreadLoop (this=0xacdcc008) at
vtc_app/vtc_comm_mgr.cpp:844
#10 0xb772cd4c in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#11 0xb753ad3e in clone () from /lib/i386-linux-gnu/libc.so.6


So looks like an event fired that was not in the hash table, which should
only happen if the event was freed, from what I can tell. This points to a
clear application bug, however, there is a bit more to the store.

Examining the libevent hash table that tracks the events, we found this:

== 667 0xb5e248f0 ==
     {node = {hte_next = 0x0}, ptr = 0x0, added = 0}
global_debug_map.hth_table[666].node.hte_next
== 668 0xb5600858 ==
     {node = {hte_next = 0x0}, ptr = 0xb5600808, added = 1}
global_debug_map.hth_table[667].node.hte_next
== 669 0xb5624950 ==
     {node = {hte_next = 0x0}, ptr = 0xb5624900, added = 1}
global_debug_map.hth_table[668].node.hte_next
== 676 0x97a5390 ==
     {node = {hte_next = 0x0}, ptr = 0x97a50a8, added = 0}
global_debug_map.hth_table[669].node.hte_next



Bucket 667 has an entry, but the pointer to the event is 0.  The hash entry
should be deleted (mm_free()) when the event is freed, and have a valid
event pointer otherwise, so the event likely was not freed. All of the
other entries in the hash look good. The event that is currently being
processed in the backtrace, 0xb56248a0, maps to bucket 667 based on the
hash algorithm:  ((0xb56248a0 >> 6) % 769) == 667, which is our bucket.
This appears to be some sort of corruption if I am understanding the hash
table logic correctly.

The event base was good and our application only calls event_free() and not
any of the other APIs that could free an event from the hash
(event_reinit() and a couple others).

I see in the 2.0.22 release notes that a race was fixed in event_active().
Do you think that could cause this? Our next step is to move to 2.0.22 on
our test system to see if the problem still happens. However, this branch
of code is very close to a release, so we are very cautious about changing
things without a clear cause and effect.

We have had several engineers reviewing the application code and so far
have not seen any invalid libevent usages, and as I said, valgrind is
clean. Clearly we are doing something wrong in the application or have
found some sort of edge case within the library. And libevent 2.0.21 has
been very stable as I have used it on this project and others.

The application is complicated enough that I can't easily provide code
snippets that are meaningful, and since we are not sure how to reproduce
the problem, I don't have any test code that can reproduce it.

All of the clients are running under Ubuntu 12.04.

Has anyone else seen anything like this?

Thanks,
-Bill

Reply via email to