Well, optimally we'd like to get the batch compression done; the problem is that with the compressing on the fly, we had to turn off the "safe zipping" because otherwise the compression was essentially nonexistent (I believe it compressed the file by something like 5%, which was essentially worthless). This is why we see the files sometimes getting corrupted, I believe.
The way our logs work, we write to logfiles using the dynamic filename method, with a new logfile on every quarter hour. At one minute past the quarter hour, we HUP rsyslog and then compress the logs that are more than an hour stale (this is to prevent problems during daylight savings, something that will hopefully be fixed by moving to UTC, another thing on my list for the future). So there should be no problems with rsyslog not releasing the filehandle or anything like that before the compression. Conveniently, I do have impstats enabled, but putting it through the analyzer tool gives me a graph I have no idea how to interpret (and does not seem to make any sense). The only other log messages I see from rsyslog in the logfiles are when rsyslog is HUPed by the compression script. The queues are growing, that much I can see, and I know it is only the queue for the cross-datacenter traffic because when it eventually gets so bad it has to write files to disk, they are all logs destined for the other datacenter. On Mon, 2015-02-09 at 15:28 -0800, David Lang wrote: > I'd be very interested to hear if moving to v8 addresses the > compressed-files-corrupted-on-shutdown problem (this should not be happening, > but it's not worth troubleshooting much on the old version). Having rsyslog > write the data compressed should result in significantly less I/O than > writing > them uncompressed (the compression is all done in RAM), although, as you > noted, > rsyslog doesn't compress in as large a chunk as batch compressors, so it > doesn't > get as good a compression (although, did you try setting ziplevel 9? or did > you > leave it at the default of 5?) > > One thing that could be causing you problems is that it takes longer to > finish > writing the logs when you do a restart, put a sleep between sending the HUP > to > rsyslog and when you move/compress the files that it's created. > > In terms of your queueing issues, the first thing to do whenever you have > quing > problems is to make sure you have impstats enabled. If you set it up with the > default format and writing to a file with cumulative counters (bascially the > default), there is an automated analysis page on rsyslog.com that will look > through them to point out problems for you. > > I would look at the impstats output and look for the output blocking and the > queues starting to grow. Then I would look for any logs from rsyslog around > that > time (if you are having the connection fail you will see an error where the > output is being suspended and restarted). If it's not halting entirely, but > merely slowing to a crawl, you won't see the output suspended, just that it's > queue size will be growing, but it will still be processing messages. > > Once you see what's happening there, it will be much easier to figure out why. > > David Lang > > On Mon, 9 Feb 2015, Chastity Blackwell wrote: > > > I'm running into a strange spooling problem with our rsyslog > > infrastructure and I'm not quite sure where to start poking to try and > > figure out the issue, so I figure turning to the mailing list is the > > next place to go now that I'm out of ideas. > > > > We have two datacenters, A and B. Each one has 100 tomcat servers as a > > frontend, which generate business events that are fed back to a load > > balancer sitting in front of 8 archiver/forwarder boxes. Each > > archiver/forwarder takes in the event stream from the tomcat servers, > > writes a copy to disk, then forwards a complete copy of the event stream > > to the load balancer in the other datacenter, as well as to two hosts in > > their datacenter which do real-time analytics on it. Traffic between the > > datacenters travels over a VPN. Everything is running rsyslog 7.4.9 atm > > (one of my projects for this quarter is to update to v8) with TCP > > logging. Datacenter A handles roughly 20% more traffic than datacenter B > > on average. > > > > Previously, I had a configuration where the archivers were writing their > > event streams to disk compressed on the fly. This was a legacy > > configuration from when we had fewer archivers/forwarders, so I/O > > contention would be a problem when doing massive amounts of compression. > > However, this caused some issues because sometimes when rsyslog was > > restarted, the gzip headers/footers would get written incorrectly and > > corrupt the compressed file, plus the files were about twice as large as > > they would be if we used a batch compression method. > > > > Two weeks ago, I changed the configuration so that rsyslog would write > > its logs uncompressed. Every fifteen minutes, a cron job HUPs the > > rsyslog process, then compresses the uncompressed log files. Another > > cron job a few minutes later uploads the compressed files to S3. > > > > Over the last two weeks, we have seen behavior where the event stream > > from the archivers/forwarders in datacenter A to the load balancer in > > datacenter B will start spooling, sometimes for hours; however, the > > streams to the two analytics boxes locally in datacenter A do not seem > > to be affected. Nor does datacenter B have any problem sending its logs > > to datacenter A, except when the spooling gets bad enough that the > > archivers in datacenter A write spoolfiles to disk -- and then B quickly > > recovers once the spoolfiles are finished writing to disk. The issue > > does not appear to happen at any particular time of day (sometimes it's > > in the morning, other times in the afternoon) and it doesn't appear to > > closely correlate with traffic, though it does only happen during the > > day, when out traffic is highest overall. > > > > My first thought was a problem with the VPN, but that does not appear to > > be the case; transferring a file between A and B with scp, for instance, > > works just fine, and there is no significant problem with latency or > > packet loss. My next thought was that the batch compression method was > > causing I/O contention in datacenter B, which was causing rsyslog on its > > archiver/forwarders to be unable to take messages; however, iostat > > reports that %util peaks at ~16% on the hosts in datacenter B, compared > > to nearly 25% in datacenter A (so if that were the problem, I would > > assume B would have trouble sending to A, not vice versa). Furthermore, > > if the issue *is* with contention on the datacenter A > > archivers/forwarders, why is their stream to the analytics hosts not > > backing up as well? > > > > Tomorrow I will likely test just turning off the compression job for a > > while to see if that makes this problem disappear (assuming the problem > > shows up at all), but I don't really know what else to look at to > > determine the cause of the issue. Does anyone have suggestions on what > > the issue could be or where else to poke? The only thing I can think of > > that I haven't tried (or considered trying) yet is turning on rsyslog > > debugging, and only because I suspect it will be hard to pull the signal > > from the noise. > > > > > > _______________________________________________ > > rsyslog mailing list > > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.adiscon.net_mailman_listinfo_rsyslog&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=iwI6dmlOx5npUdH86hSakn7schYc5Kvvz7yQK5-Z9lw&e= > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.rsyslog.com_professional-2Dservices_&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=g8Rj1G7LkFlooXXDzds6r2T0GgzM9-_1U2prySlRNIc&e= > > > > What's up with rsyslog? Follow > > https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_rgerhards&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=R2c6ZDtbj-9nM0czi6sGGGzTpHViK2fNlR_GSFb4IwA&e= > > > > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of > > sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T > > LIKE THAT. > > > _______________________________________________ > rsyslog mailing list > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.adiscon.net_mailman_listinfo_rsyslog&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=iwI6dmlOx5npUdH86hSakn7schYc5Kvvz7yQK5-Z9lw&e= > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.rsyslog.com_professional-2Dservices_&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=g8Rj1G7LkFlooXXDzds6r2T0GgzM9-_1U2prySlRNIc&e= > > What's up with rsyslog? Follow > https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_rgerhards&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=R2c6ZDtbj-9nM0czi6sGGGzTpHViK2fNlR_GSFb4IwA&e= > > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of > sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T > LIKE THAT. _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

