Re: [rsyslog] rsyslog frequently queuing to disk when it should be sending over the network

Dan Finn Wed, 04 Dec 2013 09:27:26 -0800

I’m thinking it’s most likely something around #3.  :)

I don’t think it’s a network or F5 related problem as far as I can tell.
For example, right now we have a server that is writing logs to the local
spool.  I ran tcpdump and I can see rsyslog talking to the destination
servers just fine but the spool is slowly growing.  According to netstat
rsyslog is only making 1 TCP connection to the VIP on the F5 and it seems
to be able to pass traffic through that connection.




DAN FINN
Linux System Administrator
 
Office: 801-746-7580 ext. 5381
Mobile: 801-609-4705
[email protected]
 
Backcountry.com <http://www.backcountry.com/>
Competitive Cyclist <http://www.competitivecyclist.com/>
RealCyclist.com <http://www.realcyclist.com/>
Dogfunk.com <http://www.dogfunk.com/>
SteepandCheap.com <http://www.steepandcheap.com/>
Chainlove.com <http://www.chainlove.com/>
WhiskeyMilitia.com <http://www.whiskeymilitia.com/>





On 12/4/13, 9:31 AM, "Dave Caplinger" <[email protected]>
wrote:

>Without impstats output it's hard to say for sure, but since your config
>is so succinct, you are getting a lot of default buffer sizes and
>watermark parameters.  I see you have $ActionResumeRetryCount set to -1
>for infinite retries (which is good).  Note though that default high and
>low water marks are 8,000 and 2,000 messages, respectively.  So once you
>get into disk-assisted mode, you won't leave it until the action queue
>gets all the way down to 2000 messages.  The default action queue size
>will be 10,000 messages, and that's really not very much, especially in
>an environment that has significant spikes in volume.
>
>The other possibilities that come to mind are:
>
>1) that the F5 is correctly sending to an rsyslog server that isn't
>listening any more for some reason
>
>If the receiving side's TCP session gets stuck, or something else goes
>wrong but the F5 doesn't know it, the hashing algorithm will continue to
>send traffic to the same (dead) destination.  TCP default timeouts are 2
>minutes; this can seem like an eternity when digging through packet
>captures.  So on the sending side, perhaps it sends a SYN trying to open
>the session, and then nothing happens for 2 minutes before it tries all
>over again?
>
>2) perhaps there's something else in the network breaking the TCP
>session, such as a firewall doing NAT
>
>I've seen cases before where the NAT-ing firewall would time-out
>translated IP addresses after a certain period, breaking long-running
>sessions.  The Cisco PIX/ASA, for example, has both idle
>address-translation timeouts, as well as total duration timeouts.  So
>even a currently in-use session can still be affected by something like
>this.
>
>3) maybe there is some odd behavior in v4 of rsyslog pertaining to this
>situation that has long since been fixed :-)
>
>Not pointing fingers; I just don't have a lot of experience with rsyslog
>that old so I'm just speculating.
>
>--
>Dave Caplinger, Director of Architecture  |  402.361.3063
>Solutionary  |  Relevant  .  Intelligent  .  Security
>
>On Dec 3, 2013, at 6:10 PM, Dan Finn <[email protected]> wrote:
>
>> I’ve done that and I’ve seen 2 things happen during these periods where
>> files are being written locally.
>> 
>> 1) Nothing at all was attempted to be sent to the remote destination.
>> Using telnet I could make a connection just fine but rsyslog wasn’t even
>> attempting to send or talk to the destination server over TCP 514.
>> Message queue was growing extremely fast.  I can’t explain it but on the
>> 2nd or 3rd restart it started talking to the remote again and began
>> flushing out the queue.
>> 2) lots of traffic is going to the remote over TCP 514.  The queue is
>> slowly growing but growing at a consistent rate.  This is the most
>>common
>> situation, I’ve only seen situation #1 once.  I don’t see any errors or
>> retrys or anything like that.
>> 
>> On 12/3/13, 5:01 PM, "David Lang" <[email protected]> wrote:
>> 
>>> On Tue, 3 Dec 2013, Erik Steffl wrote:
>>> 
>>>> we have sort of similar problem, in our case it's Amazon Elastic Load
>>>> Balancer (ELB) that somehow causes the connection go "bad" if there is
>>>> no 
>>>> traffic for 5 min (not sure what the exact time is, 1 minute is ok, 5
>>>> minutes 
>>>> is not).
>>>> 
>>>> not sure what going "bad" actually means (still investigating) but the
>>>> data 
>>>> is not going through, rsyslog sends data but there is no response...
>>>>it
>>>> recovers eventually but not sure what exactly triggers the recovery
>>>> (sending 
>>>> more messages is what triggers it but how exactly is not clear).
>>>> 
>>>> It's not the same case but maybe you can look into VIP and connections
>>>> and 
>>>> see what happens there, maybe use strace to see what are the responses
>>>> when 
>>>> rsyslogd sends data to destination...
>>> 
>>> or use tcpdump to watch the traffic over the network.
>>> 
>>> David Lang
>>> 
>>>>    erik
>>>> 
>>>> On 12/03/2013 01:12 PM, Dan Finn wrote:
>>>>> I had kind of wondered about that as well but I have a few reasons
>>>>>that
>>>>> make it seem like that is not the case.
>>>>> 
>>>>> The ³central server² is actually a VIP on our F5 load balancer with 4
>>>>> rsyslog destination servers behind it.  We have about 200 servers in
>>>>> our
>>>>> environment and during these busy times the only servers that ever
>>>>> seem to
>>>>> log locally are the postgres servers.  The volume of logs being
>>>>> written on
>>>>> these servers is certainly much higher than anywhere else.  My theory
>>>>> is
>>>>> that the rsyslog ³client² is not keeping up with the sheer volume on
>>>>> these
>>>>> servers during the busy times but until I can find some concrete info
>>>>> that
>>>>> is just a theory.
>>>>> 
>>>>> We are looking gat upgrading to v7 but unfortunately that¹s not going
>>>>> to
>>>>> be a quick fix.  I was hoping maybe there was an issue in my config
>>>>>or
>>>>> something that could be tweaked but it sounds like maybe that is not
>>>>> the
>>>>> case?
>>>>> 
>>>>> I did capture some debug output while this was happening.
>>>>> Unfortunately
>>>>> it was pretty large so I don¹t know if I can share the whole thing
>>>>>but
>>>>> is
>>>>> there anything in particular I would be looking for in there?  I see
>>>>> that
>>>>> it says it¹s writing the files locally but I didn¹t see where it says
>>>>> why.
>>>>> 
>>>>> Thanks,
>>>>> Dan
>>>>> 
>>>>> On 12/3/13, 3:03 AM, "David Lang" <[email protected]> wrote:
>>>>> 
>>>>>> you are sending the logs via TCP, which means that if the system you
>>>>>> are
>>>>>> sending
>>>>>> logs to gets backed up, logs will queue on the sending system,
>>>>>> spilling
>>>>>> to disk
>>>>>> as needed.
>>>>>> 
>>>>>> the bottleneck is probably on the central server, but we have no
>>>>>>info
>>>>>> about what
>>>>>> it's doing.
>>>>>> 
>>>>>> The go-to tool for diagnosting this sort of problem is the impstats
>>>>>> module, but
>>>>>> I don't think that existed back in the 4.x days, and tracking down
>>>>>>the
>>>>>> bottleneck without it is significantly harder. Is there any way you
>>>>>> can
>>>>>> upgrade
>>>>>> to a current version?
>>>>>> 
>>>>>> David Lang
>>>>>> 
>>>>>>  On Mon, 2 Dec 2013, Dan Finn wrote:
>>>>>> 
>>>>>>> Date: Mon, 2 Dec 2013 20:53:54 +0000
>>>>>>> From: Dan Finn <[email protected]>
>>>>>>> Reply-To: rsyslog-users <[email protected]>
>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>> Subject: [rsyslog] rsyslog frequently queuing to disk when it
>>>>>>>should
>>>>>>> be
>>>>>>>     sending over the network
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> I¹m trying to get some insight into an issue that we have been
>>>>>>>seeing
>>>>>>> quite a bit.  We have some postgres servers that are quite verbose.
>>>>>>> When the servers get busy we have an issue where they queue their
>>>>>>> logs
>>>>>>> locally instead of sending over the network however I can¹t find
>>>>>>>any
>>>>>>> reason why that would be, at least not from a OS resource
>>>>>>>standpoint.
>>>>>>> We are running rsyslog4-4.8.0-1.ius.el5.  This is my config from
>>>>>>>the
>>>>>>> client that was having issues : http://pastebin.com/n3XpRdMm.
>>>>>>> 
>>>>>>> I watched it queue about 10k files under /var/spool/rsyslog before
>>>>>>>I
>>>>>>> finally had to manually delete them out because disk was filling
>>>>>>>up.
>>>>>>> 
>>>>>>> What¹s the best way to get some insight into why this might be
>>>>>>> happening?  Is there a way I can enable some debug logging for the
>>>>>>> rsyslog process itself?  Any settings in our config that could be
>>>>>>> tweaked?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Dan
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> DAN FINN
>>>>>>> 
>>>>>>> Linux System Administrator
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> [email protected]<mailto:[email protected]>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Backcountry.com<http://www.backcountry.com/>
>>>>>>> 
>>>>>>> Competitive Cyclist<http://www.competitivecyclist.com/>
>>>>>>> 
>>>>>>> RealCyclist.com<http://www.realcyclist.com/>
>>>>>>> 
>>>>>>> Dogfunk.com<http://www.dogfunk.com/>
>>>>>>> 
>>>>>>> SteepandCheap.com<http://www.steepandcheap.com/>
>>>>>>> 
>>>>>>> Chainlove.com<http://www.chainlove.com/>
>>>>>>> 
>>>>>>> WhiskeyMilitia.com<http://www.whiskeymilitia.com/>
>>>>>>> _______________________________________________
>>>>>>> rsyslog mailing list
>>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>>> http://www.rsyslog.com/professional-services/
>>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>>> myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT
>>>>>>> POST
>>>>>>> if you DON'T LIKE THAT.
>>>>> 
>>>>> _______________________________________________
>>>>> rsyslog mailing list
>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>> http://www.rsyslog.com/professional-services/
>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>> myriad of 
>>>>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>>> DON'T 
>>>>> LIKE THAT.
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> rsyslog mailing list
>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>> http://www.rsyslog.com/professional-services/
>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>> myriad of 
>>>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>> DON'T 
>>>> LIKE THAT.
>> 
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
>>if you DON'T LIKE THAT.
>
>_______________________________________________
>rsyslog mailing list
>http://lists.adiscon.net/mailman/listinfo/rsyslog
>http://www.rsyslog.com/professional-services/
>What's up with rsyslog? Follow https://twitter.com/rgerhards
>NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>DON'T LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] rsyslog frequently queuing to disk when it should be sending over the network

Reply via email to