Thanks for the reply Azat, After a long investigation we discovered a bad
memory module in the ESX server.  Once that stick was replaced, we never
saw the problem again. The ESX was running about 15,000 instances of our
app in a test environment and we would see the crash on just a few every
night.  Never saw any other app or VM os crash, so I guess our app was the
only one hitting that bad memory (statistically makes sense).  Thanks
memtestx86.

-Bill



On Sat, Oct 31, 2015 at 12:36 PM, Azat Khuzhin <[email protected]> wrote:

> On Wed, Feb 25, 2015 at 09:25:03AM -0500, Bill Vaughan wrote:
> > Hi all,
> >
> > We have been seeing a fairly hard to reproduce problem with crashes
> inside
> > of libevent in our overnight scale tests.  We run 15,000 clients in an
> ESX
> > environment and have been getting one or two crashes per night. We are
> > running libevent 2.0.21 on this test system. We have moved to 2.0.22 on
> our
> > development branch.
> >
> > The application is relatively large and complex and it is quite possible
> we
> > have a memory trampler, but valgrind tests so far are clean and code
> > inspection hasn't found anything.
>
> What valgrind tools do you use?
>
> AFAIK valgrind (at least memcheck) limited to one thread, i.e. it
> emulates multi-threading (and you have multi-threaded app):
>
> http://valgrind.org/docs/manual/manual-core.html#manual-core.pthreads_perf_sched
>     The main thing to point out with respect to threaded programs is
>     that your program will use the native threading library, but
>     Valgrind serialises execution so that only one (kernel) thread is
>     running at a time. This approach avoids the horrible implementation
>     problems of implementing a truly multithreaded version of Valgrind,
>     but it does mean that threaded apps never use more than one CPU
>     simultaneously, even if you have a multiprocessor or multicore
>     machine.
>
> Can you try to use gcc/clang asan/tsan instead:
> $ gcc -fsanitize=thread ...
> $ gcc -fsanitize=address ...
>
> > We recently ran the overnight test with libevent debugging enabled
> > (event_enable_debug_mode(), evthread_enable_lock_debuging()) and had the
> > following assert in the libevent debug hash table:
> >
> >
> > [err] event_del_internal: noting a del on a non-setup event 0xb56248a0
> > (events: 0x0, fd: -1, flags: 0x80)
> > === Command detached from window (Mon Feb 23 23:13:14 2015) ===
> > === Command terminated with signal 6 (core file generated) (Mon Feb 23
> > 23:13:14 2015) ===
> >
> >
> > Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
> > Core was generated by `./vtc_app -Script-Client29-333'.
> > Program terminated with signal 6, Aborted.
> > #0  0xb775e424 in __kernel_vsyscall ()
> > (gdb) where
> > #0  0xb775e424 in __kernel_vsyscall ()
> > #1  0xb747a1df in raise () from /lib/i386-linux-gnu/libc.so.6
> > #2  0xb747d825 in abort () from /lib/i386-linux-gnu/libc.so.6
> > #3  0x0814bc31 in event_exit (errcode=-559030611) at log.c:79
> > #4  0x0814bd53 in event_errx (eval=-559030611,
> >     fmt=0x82105b0 "%s: noting a del on a non-setup event %p (events:
> 0x%x,
> > fd: %d, flags: 0x%x)") at log.c:136
> > #5  0x0813dc07 in event_del_internal (ev=<optimized out>) at event.c:2265
> > #6  event_process_active_single_queue (activeq=<optimized out>,
> > base=<optimized out>) at event.c:1324
> > #7  event_process_active (base=<optimized out>) at event.c:1420
> > #8  event_base_loop (base=0xb6a09d58, flags=2) at event.c:1621
> > #9  0x08073960 in VtcCommMgr::ThreadLoop (this=0xacdcc008) at
> > vtc_app/vtc_comm_mgr.cpp:844
> > #10 0xb772cd4c in start_thread () from
> /lib/i386-linux-gnu/libpthread.so.0
> > #11 0xb753ad3e in clone () from /lib/i386-linux-gnu/libc.so.6
> >
> >
> > So looks like an event fired that was not in the hash table, which should
> > only happen if the event was freed, from what I can tell. This points to
> a
> > clear application bug, however, there is a bit more to the store.
> >
> > Examining the libevent hash table that tracks the events, we found this:
> >
> > == 667 0xb5e248f0 ==
> >      {node = {hte_next = 0x0}, ptr = 0x0, added = 0}
> > global_debug_map.hth_table[666].node.hte_next
> > == 668 0xb5600858 ==
> >      {node = {hte_next = 0x0}, ptr = 0xb5600808, added = 1}
> > global_debug_map.hth_table[667].node.hte_next
> > == 669 0xb5624950 ==
> >      {node = {hte_next = 0x0}, ptr = 0xb5624900, added = 1}
> > global_debug_map.hth_table[668].node.hte_next
> > == 676 0x97a5390 ==
> >      {node = {hte_next = 0x0}, ptr = 0x97a50a8, added = 0}
> > global_debug_map.hth_table[669].node.hte_next
> >
> >
> >
> > Bucket 667 has an entry, but the pointer to the event is 0.  The hash
> entry
> > should be deleted (mm_free()) when the event is freed, and have a valid
> > event pointer otherwise, so the event likely was not freed. All of the
> > other entries in the hash look good. The event that is currently being
>
> If you are writing in the area of memory into which you must not write,
> then you can't trust all this, since this can be a side effect, not the
> core issue.
>
> > processed in the backtrace, 0xb56248a0, maps to bucket 667 based on the
> > hash algorithm:  ((0xb56248a0 >> 6) % 769) == 667, which is our bucket.
> > This appears to be some sort of corruption if I am understanding the hash
> > table logic correctly.
> >
> > The event base was good and our application only calls event_free() and
> not
> > any of the other APIs that could free an event from the hash
> > (event_reinit() and a couple others).
> >
> > I see in the 2.0.22 release notes that a race was fixed in
> event_active().
> > Do you think that could cause this? Our next step is to move to 2.0.22 on
> > our test system to see if the problem still happens. However, this branch
> > of code is very close to a release, so we are very cautious about
> changing
> > things without a clear cause and effect.
>
> Not sure, but of course this could impact.
> And since it's been awhile maybe you already tried 2.0.22?
>
> > We have had several engineers reviewing the application code and so far
> > have not seen any invalid libevent usages, and as I said, valgrind is
> > clean. Clearly we are doing something wrong in the application or have
> > found some sort of edge case within the library. And libevent 2.0.21 has
> > been very stable as I have used it on this project and others.
> >
> > The application is complicated enough that I can't easily provide code
> > snippets that are meaningful, and since we are not sure how to reproduce
> > the problem, I don't have any test code that can reproduce it.
> >
> > All of the clients are running under Ubuntu 12.04.
> >
> > Has anyone else seen anything like this?
>
> Sorry for such a long delay.
> ***********************************************************************
> To unsubscribe, send an e-mail to [email protected] with
> unsubscribe libevent-users    in the body.
>

Reply via email to