Re: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed

Christine Caulfield Wed, 27 Jan 2021 03:17:12 -0800

OK. There are definitely race conditions in the timer handling of libqb- at least according to helgrind which found 2 rather easily, so Isuspect that is what is happening here.

I'm preparing a patch to fix them at the moment. The patch will go intothe version 2 stream. Depending on which distribution you're using youmight need to ask for a backport of that patch to v1 or, if you're usingsources, ask me very nicely for a backport to upstream ;-)


Chrissie

On 27/01/2021 08:07, Christine Caulfield wrote:

Hi,
I've never seen anything like that before and, as ou say, no real workha sbeen don in that are since 1.0.2 so I doubt it's a fixed bug.
Although libqb claims thread-safety I wouldn't rule out a threading bugwith you adding timers from another thread. Looking at the test suitethere is actually very little testing of libqbs thread-safety so that'swhere I would start looking for a start.
I'll have a play with it today, but you are able to come up with areproducer for us that would be very helpful.
Chrissie

On 25/01/2021 05:29, Miranda, Russel wrote:
Hello ClusterLabs Developers,
I'm a developer working on an application that is using libqb (1.0.2)for its main event loop, and after running for a few hours, I'mencountering this assertion failure:
loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state ==QB_POLL_ENTRY_ACTIVE' failed.
The assertion causes a SIGABRT, and generates a core file. The stacktrace is at the point of the failure is:
#0  0x00007ff3c76dd207 in raise () from /lib64/libc.so.6
#1  0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6
#2  0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6
#4 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) atloop_timerlist.c:67#5 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at../include/tlist.h:210#6 expire_the_timers (s=0x5dffe80, ms_timeout=<optimized out>) atloop_timerlist.c:78#7 0x00007ff3ca1c6acc in qb_loop_run (lp=<optimized out>) atloop.c:175 ...
Going back up to the point of the assertion (make_job_from_tmo), hereis what the "data" parameter looks like:
(gdb) p *((struct qb_loop_timer*)data)
$1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source =0x5dffe80, user_data = 0x7ff3c9144f60 <sessions+6240>, type =QB_LOOP_FD}, dispatch_fn = 0xac8b00 <sessionTimeoutHandler(void*)>, p =QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY,check = 978999470, install_pos = 11}
Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTYwhen the assertion fails.
The timer that appears to have expired was added a few seconds earlierat the point when a message was sent (via socket) to an externalentity. Typically, when that entity responds, the timer is stopped(via qb_loop_timer_del) to prevent expiration. If timer expirationoccurs before the response is received, the registered dispatchfunction is called to handle failure of the external entity to respond.
That this timer would expire and call the dispatch_fn(sessionTimeoutHandler(void*)) passing in the value of user_data(0x7ff3c9144f60) at this point in the operation of the program isentirely plausible / expected, and I have validated that user_datapoints to a valid object of the appropriate type as well - so all ofthe caller-settable data looks correct to me in the timer entry. WhileI cannot completely exclude the possibility that something stepped onthe t->state value, it seems unlikely that it would have affected onlythat one object property in the timer entry.
It appears to me that the timer has expired, but the timer list entryhas the wrong state. I'm confused to how that might occur. I've beensearching through the API documentation and online through forums, andI can't find anyone else reporting this particular assertion failurewithin "make_job_from_tmo", or even suggesting what we might be doingwrong to cause it. Has anyone seen this before?
Some additional notes:
* There is another thread active that can call qb_loop_timer_add toadd a timer to the main event loop (it is not the timer that hasexpired here), but it could have been adding a timer at the same timethis timer expired. My understanding is that (excluding the IPC calls), libqb isthread-safe, so it should be safe to call the qb_loop_timer_add() APIfrom another thread. qb_loop_timer_add() is the only libqb API calledfrom the other thread.* In some cases we are adding timers that expire immediately (that is,with a delay as short as 0ns). That is not the case with thisparticular timer, it has a delay that is several seconds long. The qb_loop_timer_add() API doesn't indicate a minimum for thedelay, so I am not aware of this being an incorrect usage - though wecould probably switch to calling the qb_loop_job_add API instead ifthat is an incorrect use of the API.
None of the libqb releases since 1.0.2 appear to have any changesintroduced specifically to address a problem like this - though I dosee there were some minor updates made to the loop_timerlist.c code tosilence warnings. If there is a known problem in this area in 1.0.2, Iwill try see how hard it is going to be to step up to the newer version.
I'm hoping someone has seen this problem before, or can suggestsomething that we might be doing wrong in the use of the API thatcould lead to this state.
Sincere thanks for any suggestions,

Russ

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/developers

ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/developers

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/developers

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed

Reply via email to