Little bit of an elaborate story.

So, I was fighting a segfault for last few hours. It was happening in
freeing-up of msg. Here is backtrace (it was the same in many core-dumps):

(gdb) bt
#0  0x0000000100000019 in ?? ()
#1  0x00007fa23ee9d536 in lh_table_free (t=0x7fa22a28bc70) at linkhash.c:116
#2  0x00007fa23ee9a6fd in json_object_object_delete (jso=0x7fa22967bf50) at
json_object.c:273
#3  0x00007fa23ee9d536 in lh_table_free (t=0x7fa2009ac960) at linkhash.c:116
#4  0x00007fa23ee9a6fd in json_object_object_delete (jso=0x7fa22dcd8a70) at
json_object.c:273
#5  0x000000000041bef7 in msgDestruct (ppThis=0x7fa2317f9b48) at msg.c:928
#6  0x0000000000441bd4 in DeleteProcessedBatch (pThis=0xa7eb90,
pBatch=0xac35a8) at queue.c:1589
#7  0x0000000000441c9e in DequeueConsumableElements (pThis=0xa7eb90,
pWti=0xac3570, piRemainingQueueSize=0x7fa2317f9bd8) at queue.c:1626
#8  0x0000000000441e6d in DequeueConsumable (pThis=0xa7eb90, pWti=0xac3570)
at queue.c:1680
#9  0x000000000044226c in DequeueForConsumer (pThis=0xa7eb90,
pWti=0xac3570) at queue.c:1821
#10 0x0000000000442502 in ConsumerReg (pThis=0xa7eb90, pWti=0xac3570) at
queue.c:1875
#11 0x000000000043d95a in wtiWorker (pThis=0xac3570) at wti.c:334
#12 0x000000000043c51f in wtpWorker (arg=0xac3570) at wtp.c:389
#13 0x00007fa23f6b6b50 in start_thread (arg=<optimized out>) at
pthread_create.c:304
#14 0x00007fa23e7de7bd in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#15 0x0000000000000000 in ?? ()


Now, turns out, the thing lh_table_free is trying to free(#1), is basically
an object with several keys, and whenever it failed, it failed on the last
one. How do I know that?

(gdb) up 1
#1  0x00007fa23ee9d536 in lh_table_free (t=0x7fa22a28bc70) at linkhash.c:116
116     linkhash.c: No such file or directory.
(gdb) p *c
$1 = {k = 0x7fa22dcd8a50, v = 0xa6bd20, next = 0x0, prev = 0x7fa22e93d490}

And out of 2 rulebases(both being called for events of roughly the same
frequency), every single segfault was for the rulebase that was using tags
on rule.

The rulebase in question had 2 rules, matching several fields, and both
rules had tags applied.

Note that ln_normalize applies event.tags only when tags are present, and
insert of new key in json-c linked-hash-map goes in the end of the
linked-list (check lh_table_insert).

So, this key-value pair was supposed to be 'event.tags' and corresponding
tags. The data that c was pointing to was garbled, so I couldn't tell what
the key actually is, but I can conclude it was 'event.tags' based on size
of objects. The garbling may be because the memory allocated for key was
reused(note key is freed-up before value, check json_object_lh_entry_free).
Considering it is allocated in a different thread (when message is parsed
using mmnormalize), it is possible (as opposed to an object allocated in
the same thread, using the same virtual-address range from one thread only).

Now, what makes it a race condition? Not that event.tags object in
ln_normalize is not duplicated, instead it is the same object with
incrementing reference count. Not freeing routine does the counter check
without locking or using atomic operation, check json_object_put.

The only lock that msgDestruct has, is on msg->mut, which means different
messages being processed at the same time can get that lock and
concurrently try to free 'event.tags' object.

With careful timing, some of them may concurrently decrement the counter
and one of them may free the object. Now segfault can happen either because
someone accesses it, or because someone tries to free it again etc.

For me, the segfault was reproducing every few hours.

The solution is simple, we need to duplicate the event.tags object instead
of passing it on with incrementing reference count.

I'll fix it, but before jumping to fix need another (few?) pairs of eyes to
check correctness in reasoning (to point out any mistakes in analysis and
ensure I haven't overlooked anything and misdiagnosed).

Can someone please help validate/correct?

-- 
Regards,
Janmejay
http://codehunk.wordpress.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to