I'm running into a strange spooling problem with our rsyslog infrastructure and I'm not quite sure where to start poking to try and figure out the issue, so I figure turning to the mailing list is the next place to go now that I'm out of ideas.
We have two datacenters, A and B. Each one has 100 tomcat servers as a frontend, which generate business events that are fed back to a load balancer sitting in front of 8 archiver/forwarder boxes. Each archiver/forwarder takes in the event stream from the tomcat servers, writes a copy to disk, then forwards a complete copy of the event stream to the load balancer in the other datacenter, as well as to two hosts in their datacenter which do real-time analytics on it. Traffic between the datacenters travels over a VPN. Everything is running rsyslog 7.4.9 atm (one of my projects for this quarter is to update to v8) with TCP logging. Datacenter A handles roughly 20% more traffic than datacenter B on average. Previously, I had a configuration where the archivers were writing their event streams to disk compressed on the fly. This was a legacy configuration from when we had fewer archivers/forwarders, so I/O contention would be a problem when doing massive amounts of compression. However, this caused some issues because sometimes when rsyslog was restarted, the gzip headers/footers would get written incorrectly and corrupt the compressed file, plus the files were about twice as large as they would be if we used a batch compression method. Two weeks ago, I changed the configuration so that rsyslog would write its logs uncompressed. Every fifteen minutes, a cron job HUPs the rsyslog process, then compresses the uncompressed log files. Another cron job a few minutes later uploads the compressed files to S3. Over the last two weeks, we have seen behavior where the event stream from the archivers/forwarders in datacenter A to the load balancer in datacenter B will start spooling, sometimes for hours; however, the streams to the two analytics boxes locally in datacenter A do not seem to be affected. Nor does datacenter B have any problem sending its logs to datacenter A, except when the spooling gets bad enough that the archivers in datacenter A write spoolfiles to disk -- and then B quickly recovers once the spoolfiles are finished writing to disk. The issue does not appear to happen at any particular time of day (sometimes it's in the morning, other times in the afternoon) and it doesn't appear to closely correlate with traffic, though it does only happen during the day, when out traffic is highest overall. My first thought was a problem with the VPN, but that does not appear to be the case; transferring a file between A and B with scp, for instance, works just fine, and there is no significant problem with latency or packet loss. My next thought was that the batch compression method was causing I/O contention in datacenter B, which was causing rsyslog on its archiver/forwarders to be unable to take messages; however, iostat reports that %util peaks at ~16% on the hosts in datacenter B, compared to nearly 25% in datacenter A (so if that were the problem, I would assume B would have trouble sending to A, not vice versa). Furthermore, if the issue *is* with contention on the datacenter A archivers/forwarders, why is their stream to the analytics hosts not backing up as well? Tomorrow I will likely test just turning off the compression job for a while to see if that makes this problem disappear (assuming the problem shows up at all), but I don't really know what else to look at to determine the cause of the issue. Does anyone have suggestions on what the issue could be or where else to poke? The only thing I can think of that I haven't tried (or considered trying) yet is turning on rsyslog debugging, and only because I suspect it will be hard to pull the signal from the noise. _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

