We are having some problems lately as our networking infrastructure starts to reach its limits, where our rsyslog servers seem to get caught in a death spiral when we start spooling to disk. I'm wondering if anyone has suggestions for what we can do to hopefully help the issue.
We have eight syslog aggregators in each of our main datacenters; these collect messages from the local machines (including a very large volume of business events, which is the bulk of the traffic), archive a copy, and then forward the messages on to the other datacenter's aggregators via a load balanced VIP. Most of the time, this works great, and we have no issues. However, we've had a 70% increase in traffic over the last six weeks and now the VPN between the two datacenters is reaching its limit, so during peak traffic periods (which can last for a number of hours), we start spooling -- first in memory, and then later to disk. It's during the despooling that we start having serious problems. Our network graphs for the VPN look like an out of control EKG, with bandwidth used spiking, then dropping like a rock, only to spike again 5-20 minutes later. Our suspicion was that there was a thundering herd issue coming into play, so this morning, prior to peak time, we implemented a change, setting $ActionResumeInterval to a random number between 21-35 to hopefully break up unspooling actions. Unfortunately, this does not appear to have solved the issue, and may have even made it worse; we began spooling into memory earlier, while file transfers between the two datacenters could be done a reasonably high speed; however, while this was going on, the network graphs showed a smooth curve, not the jagged peaks we saw yesterday. Once spooling to disk began, we started to see the jagged peaks and valleys again. So, the question I have is, what can we do to try and remedy these issues? Should I tweak $ActionResumeInterval differently to avoid the thundering herd issue (is 20 seconds too low, perhaps?) or adjust the queuing behavior? Because these are business events, they are effectively set to queue 'forever' (or at least for several days). These problems are also compounded by the fact that trying to restart rsyslog with 20GB of messages spooled in memory tends cause an unclean shutdown, which means manually processing the rsyslog spoolfiles (and potentially losing messages that were not written to disk). _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

