Re: [rsyslog] Conundrum with strange spooling behavior

Dave Caplinger Tue, 10 Feb 2015 06:34:12 -0800

I'm surprised you need to bother with the HUP if you are using dynafiles to 
write to distinct quarter-hour output files.  Does lsof indicate that rsyslog 
really does still have those older files open?  Even so, I wouldn't expect the 
extra HUP once every 15 minutes to have much of an impact unless you have very 
high log volume since the HUP would pause rsyslog briefly to flush all output 
queues and close/reopen files.  (Maybe HUP doesn't flush the queues; I don't 
recall...?)


What I suspect is more likely is that something else is contending for 
bandwidth over the link the VPN uses.  If you can catch it in the act, does 
running iperf between the two sites reveal lower throughput?  (You'll need to 
do a similar test outside of the problem window as a baseline of course.)  
We've run into a similar situation where a remote rsyslog forwarder was sending 
data via TCP through a VPN connection back to the central rsyslog collector, 
and the outbound Internet bandwidth available for rsyslog to use was often very 
small.  The input data rate of new logs at the remote forwarder simply exceeded 
the achievable output rate over the VPN link, so it would start falling behind 
and spool to disk.  Sometimes it would be able to recover as more bandwidth 
became available, but often not, and this would lead to the main queue filling 
up and ultimately dropping (all visible in impstats output).  This was with the 
VPN link doing inline compression at around 2:1, but it still wasn't sufficient.

Our solution to the "remote sender with low outbound bandwidth" issue was to 1) 
make sure we weren't forwarding things that we really didn't need, and 2) 
enable stream compression.  We initially set up a 60-second compressed batching 
solution using output dynafiles from rsyslog that were gzip'd and transferred 
by another process (then consumed on the other side, uncompressed, and 
re-injected into the log stream), and it was effective.  However, that solution 
was complex (with many moving parts that could break), and stream compression 
in rsyslog performs even better than this, especially for long-lived TCP 
connections between the two rsyslog nodes.

Note that enabling stream compression on the receiving side is all-or-nothing 
for an input.  You cannot have multiple remote devices sending a mix of 
compressed and uncompressed data to the same input queue on the receiving 
server.  But if your case is literally just one sender and one receiver, that 
should not be an issue for you.  (You can always move the compressed TCP stream 
to another port and give it it's own input queue and even ruleset.)

--
Dave Caplinger, Director of Architecture | Ph: (402) 361-3063 | Solutionary — 
An NTT Group Security Company

> On Feb 9, 2015, at 5:57 PM, Chastity Blackwell 
> <[email protected]> wrote:
> 
> Well, optimally we'd like to get the batch compression done; the problem
> is that with the compressing on the fly, we had to turn off the "safe
> zipping" because otherwise the compression was essentially nonexistent
> (I believe it compressed the file by something like 5%, which was
> essentially worthless). This is why we see the files sometimes getting
> corrupted, I believe.
> 
> The way our logs work, we write to logfiles using the dynamic filename
> method, with a new logfile on every quarter hour. At one minute past the
> quarter hour, we HUP rsyslog and then compress the logs that are more
> than an hour stale (this is to prevent problems during daylight savings,
> something that will hopefully be fixed by moving to UTC, another thing
> on my list for the future). So there should be no problems with rsyslog
> not releasing the filehandle or anything like that before the
> compression.
> 
> Conveniently, I do have impstats enabled, but putting it through the
> analyzer tool gives me a graph I have no idea how to interpret (and does
> not seem to make any sense). The only other log messages I see from
> rsyslog in the logfiles are when rsyslog is HUPed by the compression
> script.
> 
> The queues are growing, that much I can see, and I know it is only the
> queue for the cross-datacenter traffic because when it eventually gets
> so bad it has to write files to disk, they are all logs destined for the
> other datacenter.
> 
> On Mon, 2015-02-09 at 15:28 -0800, David Lang wrote:
>> I'd be very interested to hear if moving to v8 addresses the
>> compressed-files-corrupted-on-shutdown problem (this should not be happening,
>> but it's not worth troubleshooting much on the old version). Having rsyslog
>> write the data compressed should result in significantly less I/O than 
>> writing
>> them uncompressed (the compression is all done in RAM), although, as you 
>> noted,
>> rsyslog doesn't compress in as large a chunk as batch compressors, so it 
>> doesn't
>> get as good a compression (although, did you try setting ziplevel 9? or did 
>> you
>> leave it at the default of 5?)
>> 
>> One thing that could be causing you problems is that it takes longer to 
>> finish
>> writing the logs when you do a restart, put a sleep between sending the HUP 
>> to
>> rsyslog and when you move/compress the files that it's created.
>> 
>> In terms of your queueing issues, the first thing to do whenever you have 
>> quing
>> problems is to make sure you have impstats enabled. If you set it up with the
>> default format and writing to a file with cumulative counters (bascially the
>> default), there is an automated analysis page on rsyslog.com that will look
>> through them to point out problems for you.
>> 
>> I would look at the impstats output and look for the output blocking and the
>> queues starting to grow. Then I would look for any logs from rsyslog around 
>> that
>> time (if you are having the connection fail you will see an error where the
>> output is being suspended and restarted). If it's not halting entirely, but
>> merely slowing to a crawl, you won't see the output suspended, just that it's
>> queue size will be growing, but it will still be processing messages.
>> 
>> Once you see what's happening there, it will be much easier to figure out 
>> why.
>> 
>> David Lang
>> 
>> On Mon, 9 Feb 2015, Chastity Blackwell wrote:
>> 
>>> I'm running into a strange spooling problem with our rsyslog
>>> infrastructure and I'm not quite sure where to start poking to try and
>>> figure out the issue, so I figure turning to the mailing list is the
>>> next place to go now that I'm out of ideas.
>>> 
>>> We have two datacenters, A and B. Each one has 100 tomcat servers as a
>>> frontend, which generate business events that are fed back to a load
>>> balancer sitting in front of 8 archiver/forwarder boxes. Each
>>> archiver/forwarder takes in the event stream from the tomcat servers,
>>> writes a copy to disk, then forwards a complete copy of the event stream
>>> to the load balancer in the other datacenter, as well as to two hosts in
>>> their datacenter which do real-time analytics on it. Traffic between the
>>> datacenters travels over a VPN. Everything is running rsyslog 7.4.9 atm
>>> (one of my projects for this quarter is to update to v8) with TCP
>>> logging. Datacenter A handles roughly 20% more traffic than datacenter B
>>> on average.
>>> 
>>> Previously, I had a configuration where the archivers were writing their
>>> event streams to disk compressed on the fly. This was a legacy
>>> configuration from when we had fewer archivers/forwarders, so I/O
>>> contention would be a problem when doing massive amounts of compression.
>>> However, this caused some issues because sometimes when rsyslog was
>>> restarted, the gzip headers/footers would get written incorrectly and
>>> corrupt the compressed file, plus the files were about twice as large as
>>> they would be if we used a batch compression method.
>>> 
>>> Two weeks ago, I changed the configuration so that rsyslog would write
>>> its logs uncompressed. Every fifteen minutes, a cron job HUPs the
>>> rsyslog process, then compresses the uncompressed log files. Another
>>> cron job a few minutes later uploads the compressed files to S3.
>>> 
>>> Over the last two weeks, we have seen behavior where the event stream
>>> from the archivers/forwarders in datacenter A to the load balancer in
>>> datacenter B will start spooling, sometimes for hours; however, the
>>> streams to the two analytics boxes locally in datacenter A do not seem
>>> to be affected. Nor does datacenter B have any problem sending its logs
>>> to datacenter A, except when the spooling gets bad enough that the
>>> archivers in datacenter A write spoolfiles to disk -- and then B quickly
>>> recovers once the spoolfiles are finished writing to disk. The issue
>>> does not appear to happen at any particular time of day (sometimes it's
>>> in the morning, other times in the afternoon) and it doesn't appear to
>>> closely correlate with traffic, though it does only happen during the
>>> day, when out traffic is highest overall.
>>> 
>>> My first thought was a problem with the VPN, but that does not appear to
>>> be the case; transferring a file between A and B with scp, for instance,
>>> works just fine, and there is no significant problem with latency or
>>> packet loss. My next thought was that the batch compression method was
>>> causing I/O contention in datacenter B, which was causing rsyslog on its
>>> archiver/forwarders to be unable to take messages; however, iostat
>>> reports that %util peaks at ~16% on the hosts in datacenter B, compared
>>> to nearly 25% in datacenter A (so if that were the problem, I would
>>> assume B would have trouble sending to A, not vice versa). Furthermore,
>>> if the issue *is* with contention on the datacenter A
>>> archivers/forwarders, why is their stream to the analytics hosts not
>>> backing up as well?
>>> 
>>> Tomorrow I will likely test just turning off the compression job for a
>>> while to see if that makes this problem disappear (assuming the problem
>>> shows up at all), but I don't really know what else to look at to
>>> determine the cause of the issue. Does anyone have suggestions on what
>>> the issue could be or where else to poke? The only thing I can think of
>>> that I haven't tried (or considered trying) yet is turning on rsyslog
>>> debugging, and only because I suspect it will be hard to pull the signal
>>> from the noise.
>>> 
>>> 
>>> _______________________________________________
>>> rsyslog mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.adiscon.net_mailman_listinfo_rsyslog&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=iwI6dmlOx5npUdH86hSakn7schYc5Kvvz7yQK5-Z9lw&e=
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.rsyslog.com_professional-2Dservices_&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=g8Rj1G7LkFlooXXDzds6r2T0GgzM9-_1U2prySlRNIc&e=
>>> What's up with rsyslog? Follow 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_rgerhards&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=R2c6ZDtbj-9nM0czi6sGGGzTpHViK2fNlR_GSFb4IwA&e=
>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
>>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T 
>>> LIKE THAT.
>>> 
>> _______________________________________________
>> rsyslog mailing list
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.adiscon.net_mailman_listinfo_rsyslog&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=iwI6dmlOx5npUdH86hSakn7schYc5Kvvz7yQK5-Z9lw&e=
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.rsyslog.com_professional-2Dservices_&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=g8Rj1G7LkFlooXXDzds6r2T0GgzM9-_1U2prySlRNIc&e=
>> What's up with rsyslog? Follow 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_rgerhards&d=AwICAg&c=MOptNlVtIETeDALC_lULrw&r=NG8CuSlEBA6mDPeoQF5UF_3xYhyGQzrkJnxFzKZaPuM&m=FXyjYysU0GBatM3OQhggcFlNaHjngQbjQRgyfaAKL5w&s=R2c6ZDtbj-9nM0czi6sGGGzTpHViK2fNlR_GSFb4IwA&e=
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T 
>> LIKE THAT.
> 
> 
> 
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T 
> LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Conundrum with strange spooling behavior

Reply via email to