+ rsyslog-users (FYI) -- Regards, Janmejay
PS: Please blame the typos in this mail on my phone's uncivilized soft keyboard sporting it's not-so-smart-assist technology. On Dec 9, 2014 4:49 AM, "singh.janmejay" <[email protected]> wrote: > Sorry, I forgot to post that once I removed the tags from each rule, > segfaults stopped. It ran for more than 24 hours after which it was > restated deliberately for some deployment and config change etc. > > I'll duplicate it and deploy new version and post what happens. > > -- > Regards, > Janmejay > > PS: Please blame the typos in this mail on my phone's uncivilized soft > keyboard sporting it's not-so-smart-assist technology. > > On Dec 9, 2014 4:20 AM, "Pavel Levshin" <[email protected]> wrote: > >> >> Your analysis is most probably correct. As far as I remember, this object >> reuse was confusing to me. But there were no segfaults at the moment. >> >> Object reuse is intended to reduce memory allocations/frees, which have >> big impact on performance. On the other side, locks are expensive too. Even >> unsuitable for the case. >> >> Can you confirm that just duplication of event.tags solves your >> segfaults? >> >> >> -- >> Pavel >> >> >> 07.12.2014 0:03, singh.janmejay: >> >> Little bit of an elaborate story. >> >> So, I was fighting a segfault for last few hours. It was happening in >> freeing-up of msg. Here is backtrace (it was the same in many core-dumps): >> >> (gdb) bt >> #0 0x0000000100000019 in ?? () >> #1 0x00007fa23ee9d536 in lh_table_free (t=0x7fa22a28bc70) at >> linkhash.c:116 >> #2 0x00007fa23ee9a6fd in json_object_object_delete (jso=0x7fa22967bf50) >> at json_object.c:273 >> #3 0x00007fa23ee9d536 in lh_table_free (t=0x7fa2009ac960) at >> linkhash.c:116 >> #4 0x00007fa23ee9a6fd in json_object_object_delete (jso=0x7fa22dcd8a70) >> at json_object.c:273 >> #5 0x000000000041bef7 in msgDestruct (ppThis=0x7fa2317f9b48) at msg.c:928 >> #6 0x0000000000441bd4 in DeleteProcessedBatch (pThis=0xa7eb90, >> pBatch=0xac35a8) at queue.c:1589 >> #7 0x0000000000441c9e in DequeueConsumableElements (pThis=0xa7eb90, >> pWti=0xac3570, piRemainingQueueSize=0x7fa2317f9bd8) at queue.c:1626 >> #8 0x0000000000441e6d in DequeueConsumable (pThis=0xa7eb90, >> pWti=0xac3570) at queue.c:1680 >> #9 0x000000000044226c in DequeueForConsumer (pThis=0xa7eb90, >> pWti=0xac3570) at queue.c:1821 >> #10 0x0000000000442502 in ConsumerReg (pThis=0xa7eb90, pWti=0xac3570) at >> queue.c:1875 >> #11 0x000000000043d95a in wtiWorker (pThis=0xac3570) at wti.c:334 >> #12 0x000000000043c51f in wtpWorker (arg=0xac3570) at wtp.c:389 >> #13 0x00007fa23f6b6b50 in start_thread (arg=<optimized out>) at >> pthread_create.c:304 >> #14 0x00007fa23e7de7bd in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 >> #15 0x0000000000000000 in ?? () >> >> >> Now, turns out, the thing lh_table_free is trying to free(#1), is >> basically an object with several keys, and whenever it failed, it failed on >> the last one. How do I know that? >> >> (gdb) up 1 >> #1 0x00007fa23ee9d536 in lh_table_free (t=0x7fa22a28bc70) at >> linkhash.c:116 >> 116 linkhash.c: No such file or directory. >> (gdb) p *c >> $1 = {k = 0x7fa22dcd8a50, v = 0xa6bd20, next = 0x0, prev = 0x7fa22e93d490} >> >> And out of 2 rulebases(both being called for events of roughly the same >> frequency), every single segfault was for the rulebase that was using tags >> on rule. >> >> The rulebase in question had 2 rules, matching several fields, and both >> rules had tags applied. >> >> Note that ln_normalize applies event.tags only when tags are present, >> and insert of new key in json-c linked-hash-map goes in the end of the >> linked-list (check lh_table_insert). >> >> So, this key-value pair was supposed to be 'event.tags' and >> corresponding tags. The data that c was pointing to was garbled, so I >> couldn't tell what the key actually is, but I can conclude it was >> 'event.tags' based on size of objects. The garbling may be because the >> memory allocated for key was reused(note key is freed-up before value, >> check json_object_lh_entry_free). Considering it is allocated in a >> different thread (when message is parsed using mmnormalize), it is possible >> (as opposed to an object allocated in the same thread, using the same >> virtual-address range from one thread only). >> >> Now, what makes it a race condition? Not that event.tags object in >> ln_normalize is not duplicated, instead it is the same object with >> incrementing reference count. Not freeing routine does the counter check >> without locking or using atomic operation, check json_object_put. >> >> The only lock that msgDestruct has, is on msg->mut, which means >> different messages being processed at the same time can get that lock and >> concurrently try to free 'event.tags' object. >> >> With careful timing, some of them may concurrently decrement the >> counter and one of them may free the object. Now segfault can happen either >> because someone accesses it, or because someone tries to free it again etc. >> >> For me, the segfault was reproducing every few hours. >> >> The solution is simple, we need to duplicate the event.tags object >> instead of passing it on with incrementing reference count. >> >> I'll fix it, but before jumping to fix need another (few?) pairs of >> eyes to check correctness in reasoning (to point out any mistakes in >> analysis and ensure I haven't overlooked anything and misdiagnosed). >> >> Can someone please help validate/correct? >> >> -- >> Regards, >> Janmejay >> http://codehunk.wordpress.com >> >> >> _______________________________________________ >> Lognorm mailing >> [email protected]http://lists.adiscon.net/mailman/listinfo/lognorm >> >> >> >> _______________________________________________ >> Lognorm mailing list >> [email protected] >> http://lists.adiscon.net/mailman/listinfo/lognorm >> >> _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

