We are having some problems lately as our networking infrastructure
starts to reach its limits, where our rsyslog servers seem to get caught
in a death spiral when we start spooling to disk. I'm wondering if
anyone has suggestions for what we can do to hopefully help the issue.

We have eight syslog aggregators in each of our main datacenters; these
collect messages from the local machines (including a very large volume
of business events, which is the bulk of the traffic), archive a copy,
and then forward the messages on to the other datacenter's aggregators
via a load balanced VIP.

Most of the time, this works great, and we have no issues. However,
we've had a 70% increase in traffic over the last six weeks and now the
VPN between the two datacenters is reaching its limit, so during peak
traffic periods (which can last for a number of hours), we start
spooling -- first in memory, and then later to disk.

It's during the despooling that we start having serious problems. Our
network graphs for the VPN look like an out of control EKG, with
bandwidth used spiking, then dropping like a rock, only to spike again
5-20 minutes later. Our suspicion was that there was a thundering herd
issue coming into play, so this morning, prior to peak time, we
implemented a change, setting $ActionResumeInterval to a random number
between 21-35 to hopefully break up unspooling actions.

Unfortunately, this does not appear to have solved the issue, and may
have even made it worse; we began spooling into memory earlier, while
file transfers between the two datacenters could be done a reasonably
high speed; however, while this was going on, the network graphs showed
a smooth curve, not the jagged peaks we saw yesterday. Once spooling to
disk began, we started to see the jagged peaks and valleys again.

So, the question I have is, what can we do to try and remedy these
issues? Should I tweak $ActionResumeInterval differently to avoid the
thundering herd issue (is 20 seconds too low, perhaps?) or adjust the
queuing behavior? Because these are business events, they are
effectively set to queue 'forever' (or at least for several days). These
problems are also compounded by the fact that trying to restart rsyslog
with 20GB of messages spooled in memory tends cause an unclean shutdown,
which means manually processing the rsyslog spoolfiles (and potentially
losing messages that were not written to disk).

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to