I'm running into a strange spooling problem with our rsyslog
infrastructure and I'm not quite sure where to start poking to try and
figure out the issue, so I figure turning to the mailing list is the
next place to go now that I'm out of ideas.

We have two datacenters, A and B. Each one has 100 tomcat servers as a
frontend, which generate business events that are fed back to a load
balancer sitting in front of 8 archiver/forwarder boxes. Each
archiver/forwarder takes in the event stream from the tomcat servers,
writes a copy to disk, then forwards a complete copy of the event stream
to the load balancer in the other datacenter, as well as to two hosts in
their datacenter which do real-time analytics on it. Traffic between the
datacenters travels over a VPN. Everything is running rsyslog 7.4.9 atm
(one of my projects for this quarter is to update to v8) with TCP
logging. Datacenter A handles roughly 20% more traffic than datacenter B
on average.

Previously, I had a configuration where the archivers were writing their
event streams to disk compressed on the fly. This was a legacy
configuration from when we had fewer archivers/forwarders, so I/O
contention would be a problem when doing massive amounts of compression.
However, this caused some issues because sometimes when rsyslog was
restarted, the gzip headers/footers would get written incorrectly and
corrupt the compressed file, plus the files were about twice as large as
they would be if we used a batch compression method.

Two weeks ago, I changed the configuration so that rsyslog would write
its logs uncompressed. Every fifteen minutes, a cron job HUPs the
rsyslog process, then compresses the uncompressed log files. Another
cron job a few minutes later uploads the compressed files to S3.

Over the last two weeks, we have seen behavior where the event stream
from the archivers/forwarders in datacenter A to the load balancer in
datacenter B will start spooling, sometimes for hours; however, the
streams to the two analytics boxes locally in datacenter A do not seem
to be affected. Nor does datacenter B have any problem sending its logs
to datacenter A, except when the spooling gets bad enough that the
archivers in datacenter A write spoolfiles to disk -- and then B quickly
recovers once the spoolfiles are finished writing to disk. The issue
does not appear to happen at any particular time of day (sometimes it's
in the morning, other times in the afternoon) and it doesn't appear to
closely correlate with traffic, though it does only happen during the
day, when out traffic is highest overall.

My first thought was a problem with the VPN, but that does not appear to
be the case; transferring a file between A and B with scp, for instance,
works just fine, and there is no significant problem with latency or
packet loss. My next thought was that the batch compression method was
causing I/O contention in datacenter B, which was causing rsyslog on its
archiver/forwarders to be unable to take messages; however, iostat
reports that %util peaks at ~16% on the hosts in datacenter B, compared
to nearly 25% in datacenter A (so if that were the problem, I would
assume B would have trouble sending to A, not vice versa). Furthermore,
if the issue *is* with contention on the datacenter A
archivers/forwarders, why is their stream to the analytics hosts not
backing up as well?

Tomorrow I will likely test just turning off the compression job for a
while to see if that makes this problem disappear (assuming the problem
shows up at all), but I don't really know what else to look at to
determine the cause of the issue. Does anyone have suggestions on what
the issue could be or where else to poke? The only thing I can think of
that I haven't tried (or considered trying) yet is turning on rsyslog
debugging, and only because I suspect it will be hard to pull the signal
from the noise.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to