Re: [rsyslog] rsyslog frequently queuing to disk when it should be sending over the network

Dan Finn Wed, 04 Dec 2013 09:59:38 -0800

If I upgrade to v7 on the central servers can I reuse those configs?



DAN FINN
Linux System Administrator
 
Office: 801-746-7580 ext. 5381
Mobile: 801-609-4705
[email protected]
 
Backcountry.com <http://www.backcountry.com/>
Competitive Cyclist <http://www.competitivecyclist.com/>
RealCyclist.com <http://www.realcyclist.com/>
Dogfunk.com <http://www.dogfunk.com/>
SteepandCheap.com <http://www.steepandcheap.com/>
Chainlove.com <http://www.chainlove.com/>
WhiskeyMilitia.com <http://www.whiskeymilitia.com/>





On 12/4/13, 10:56 AM, "David Lang" <[email protected]> wrote:

>with a quick glance at things
>
>you are doing a lot of dynamic filename templates, since you do not
>change the 
>default dynafile cache size (and I don't know if you can on that ancient
>a 
>version), rsyslog is spending a LOT of time syncing, closing, and opening
>files.
>
>Also, you are extensively using the if..then style filters, those are
>much 
>slower than other filters on versions prior to 7.x
>
>So it's probably the case that if you upgrade your central servers to a
>current 
>version, and set a large enough DynaFileCacheSize your performance
>problems will 
>disappear.
>
>David Lang
>
>On Wed, 4 Dec 2013, Dan Finn wrote:
>
>> OK, now we might be onto something.  I can’t determine exactly which
>> remote machine the client is hitting because it’s going through the F5
>>so
>> what I did is took a look at the stats on the F5 and picked the busiest
>> remote server.  There is an rsyslog thread on there that is hovering at
>> very close to 100%.
>>
>> Here’s the config from our destination servers.  They all share an
>> identical config.  http://pastebin.com/35K9gw97
>>
>>
>>
>> DAN FINN
>> Linux System Administrator
>>
>> Office: 801-746-7580 ext. 5381
>> Mobile: 801-609-4705
>> [email protected]
>>
>> Backcountry.com <http://www.backcountry.com/>
>> Competitive Cyclist <http://www.competitivecyclist.com/>
>> RealCyclist.com <http://www.realcyclist.com/>
>> Dogfunk.com <http://www.dogfunk.com/>
>> SteepandCheap.com <http://www.steepandcheap.com/>
>> Chainlove.com <http://www.chainlove.com/>
>> WhiskeyMilitia.com <http://www.whiskeymilitia.com/>
>>
>>
>>
>>
>>
>> On 12/4/13, 10:41 AM, "David Lang" <[email protected]> wrote:
>>
>>> Ok, then the question is how fast is the receiving machine accepting
>>> messages.
>>> unless you have an unusually complex template, you should be able to
>>>send
>>> messages very fast.
>>>
>>> But if the receiving machine is not processing messages fast enough
>>>there
>>> will
>>> be a buildup. but if all it's doing is writing to local files (and you
>>> aren't
>>> doing a lot of dynamic filename stuff) it's unlikely that it should be
>>> that
>>> slow.
>>>
>>> you could look at what the different threads are doing using top
>>> (remember to
>>> hit 'H' to see the threads) and if one or more threads is maxing out
>>>the
>>> CPU,
>>> you can then look at the batching settings.
>>>
>>> But I really don't think the sending machine is the bottleneck, if it
>>>was
>>> it
>>> wouldn't be able to write the queue files either.
>>>
>>> David Lang
>>>
>>> On Wed, 4 Dec 2013, Dan Finn wrote:
>>>
>>>> I’m thinking it’s most likely something around #3.  :)
>>>>
>>>> I don’t think it’s a network or F5 related problem as far as I can
>>>>tell.
>>>> For example, right now we have a server that is writing logs to the
>>>> local
>>>> spool.  I ran tcpdump and I can see rsyslog talking to the destination
>>>> servers just fine but the spool is slowly growing.  According to
>>>>netstat
>>>> rsyslog is only making 1 TCP connection to the VIP on the F5 and it
>>>> seems
>>>> to be able to pass traffic through that connection.
>>>>
>>>>
>>>>
>>>> DAN FINN
>>>> Linux System Administrator
>>>>
>>>> Office: 801-746-7580 ext. 5381
>>>> Mobile: 801-609-4705
>>>> [email protected]
>>>>
>>>> Backcountry.com <http://www.backcountry.com/>
>>>> Competitive Cyclist <http://www.competitivecyclist.com/>
>>>> RealCyclist.com <http://www.realcyclist.com/>
>>>> Dogfunk.com <http://www.dogfunk.com/>
>>>> SteepandCheap.com <http://www.steepandcheap.com/>
>>>> Chainlove.com <http://www.chainlove.com/>
>>>> WhiskeyMilitia.com <http://www.whiskeymilitia.com/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 12/4/13, 9:31 AM, "Dave Caplinger" <[email protected]>
>>>> wrote:
>>>>
>>>>> Without impstats output it's hard to say for sure, but since your
>>>>> config
>>>>> is so succinct, you are getting a lot of default buffer sizes and
>>>>> watermark parameters.  I see you have $ActionResumeRetryCount set to
>>>>>-1
>>>>> for infinite retries (which is good).  Note though that default high
>>>>> and
>>>>> low water marks are 8,000 and 2,000 messages, respectively.  So once
>>>>> you
>>>>> get into disk-assisted mode, you won't leave it until the action
>>>>>queue
>>>>> gets all the way down to 2000 messages.  The default action queue
>>>>>size
>>>>> will be 10,000 messages, and that's really not very much, especially
>>>>>in
>>>>> an environment that has significant spikes in volume.
>>>>>
>>>>> The other possibilities that come to mind are:
>>>>>
>>>>> 1) that the F5 is correctly sending to an rsyslog server that isn't
>>>>> listening any more for some reason
>>>>>
>>>>> If the receiving side's TCP session gets stuck, or something else
>>>>>goes
>>>>> wrong but the F5 doesn't know it, the hashing algorithm will continue
>>>>> to
>>>>> send traffic to the same (dead) destination.  TCP default timeouts
>>>>>are
>>>>> 2
>>>>> minutes; this can seem like an eternity when digging through packet
>>>>> captures.  So on the sending side, perhaps it sends a SYN trying to
>>>>> open
>>>>> the session, and then nothing happens for 2 minutes before it tries
>>>>>all
>>>>> over again?
>>>>>
>>>>> 2) perhaps there's something else in the network breaking the TCP
>>>>> session, such as a firewall doing NAT
>>>>>
>>>>> I've seen cases before where the NAT-ing firewall would time-out
>>>>> translated IP addresses after a certain period, breaking long-running
>>>>> sessions.  The Cisco PIX/ASA, for example, has both idle
>>>>> address-translation timeouts, as well as total duration timeouts.  So
>>>>> even a currently in-use session can still be affected by something
>>>>>like
>>>>> this.
>>>>>
>>>>> 3) maybe there is some odd behavior in v4 of rsyslog pertaining to
>>>>>this
>>>>> situation that has long since been fixed :-)
>>>>>
>>>>> Not pointing fingers; I just don't have a lot of experience with
>>>>> rsyslog
>>>>> that old so I'm just speculating.
>>>>>
>>>>> --
>>>>> Dave Caplinger, Director of Architecture  |  402.361.3063
>>>>> Solutionary  |  Relevant  .  Intelligent  .  Security
>>>>>
>>>>> On Dec 3, 2013, at 6:10 PM, Dan Finn <[email protected]> wrote:
>>>>>
>>>>>> I’ve done that and I’ve seen 2 things happen during these periods
>>>>>> where
>>>>>> files are being written locally.
>>>>>>
>>>>>> 1) Nothing at all was attempted to be sent to the remote
>>>>>>destination.
>>>>>> Using telnet I could make a connection just fine but rsyslog wasn’t
>>>>>> even
>>>>>> attempting to send or talk to the destination server over TCP 514.
>>>>>> Message queue was growing extremely fast.  I can’t explain it but on
>>>>>> the
>>>>>> 2nd or 3rd restart it started talking to the remote again and began
>>>>>> flushing out the queue.
>>>>>> 2) lots of traffic is going to the remote over TCP 514.  The queue
>>>>>>is
>>>>>> slowly growing but growing at a consistent rate.  This is the most
>>>>>> common
>>>>>> situation, I’ve only seen situation #1 once.  I don’t see any errors
>>>>>> or
>>>>>> retrys or anything like that.
>>>>>>
>>>>>> On 12/3/13, 5:01 PM, "David Lang" <[email protected]> wrote:
>>>>>>
>>>>>>> On Tue, 3 Dec 2013, Erik Steffl wrote:
>>>>>>>
>>>>>>>> we have sort of similar problem, in our case it's Amazon Elastic
>>>>>>>> Load
>>>>>>>> Balancer (ELB) that somehow causes the connection go "bad" if
>>>>>>>>there
>>>>>>>> is
>>>>>>>> no
>>>>>>>> traffic for 5 min (not sure what the exact time is, 1 minute is
>>>>>>>>ok,
>>>>>>>> 5
>>>>>>>> minutes
>>>>>>>> is not).
>>>>>>>>
>>>>>>>> not sure what going "bad" actually means (still investigating) but
>>>>>>>> the
>>>>>>>> data
>>>>>>>> is not going through, rsyslog sends data but there is no
>>>>>>>>response...
>>>>>>>> it
>>>>>>>> recovers eventually but not sure what exactly triggers the
>>>>>>>>recovery
>>>>>>>> (sending
>>>>>>>> more messages is what triggers it but how exactly is not clear).
>>>>>>>>
>>>>>>>> It's not the same case but maybe you can look into VIP and
>>>>>>>> connections
>>>>>>>> and
>>>>>>>> see what happens there, maybe use strace to see what are the
>>>>>>>> responses
>>>>>>>> when
>>>>>>>> rsyslogd sends data to destination...
>>>>>>>
>>>>>>> or use tcpdump to watch the traffic over the network.
>>>>>>>
>>>>>>> David Lang
>>>>>>>
>>>>>>>>        erik
>>>>>>>>
>>>>>>>> On 12/03/2013 01:12 PM, Dan Finn wrote:
>>>>>>>>> I had kind of wondered about that as well but I have a few
>>>>>>>>>reasons
>>>>>>>>> that
>>>>>>>>> make it seem like that is not the case.
>>>>>>>>>
>>>>>>>>> The ³central server² is actually a VIP on our F5 load balancer
>>>>>>>>> with 4
>>>>>>>>> rsyslog destination servers behind it.  We have about 200 servers
>>>>>>>>> in
>>>>>>>>> our
>>>>>>>>> environment and during these busy times the only servers that
>>>>>>>>>ever
>>>>>>>>> seem to
>>>>>>>>> log locally are the postgres servers.  The volume of logs being
>>>>>>>>> written on
>>>>>>>>> these servers is certainly much higher than anywhere else.  My
>>>>>>>>> theory
>>>>>>>>> is
>>>>>>>>> that the rsyslog ³client² is not keeping up with the sheer volume
>>>>>>>>> on
>>>>>>>>> these
>>>>>>>>> servers during the busy times but until I can find some concrete
>>>>>>>>> info
>>>>>>>>> that
>>>>>>>>> is just a theory.
>>>>>>>>>
>>>>>>>>> We are looking gat upgrading to v7 but unfortunately that¹s not
>>>>>>>>> going
>>>>>>>>> to
>>>>>>>>> be a quick fix.  I was hoping maybe there was an issue in my
>>>>>>>>>config
>>>>>>>>> or
>>>>>>>>> something that could be tweaked but it sounds like maybe that is
>>>>>>>>> not
>>>>>>>>> the
>>>>>>>>> case?
>>>>>>>>>
>>>>>>>>> I did capture some debug output while this was happening.
>>>>>>>>> Unfortunately
>>>>>>>>> it was pretty large so I don¹t know if I can share the whole
>>>>>>>>>thing
>>>>>>>>> but
>>>>>>>>> is
>>>>>>>>> there anything in particular I would be looking for in there?  I
>>>>>>>>> see
>>>>>>>>> that
>>>>>>>>> it says it¹s writing the files locally but I didn¹t see where it
>>>>>>>>> says
>>>>>>>>> why.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>> On 12/3/13, 3:03 AM, "David Lang" <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> you are sending the logs via TCP, which means that if the system
>>>>>>>>>> you
>>>>>>>>>> are
>>>>>>>>>> sending
>>>>>>>>>> logs to gets backed up, logs will queue on the sending system,
>>>>>>>>>> spilling
>>>>>>>>>> to disk
>>>>>>>>>> as needed.
>>>>>>>>>>
>>>>>>>>>> the bottleneck is probably on the central server, but we have no
>>>>>>>>>> info
>>>>>>>>>> about what
>>>>>>>>>> it's doing.
>>>>>>>>>>
>>>>>>>>>> The go-to tool for diagnosting this sort of problem is the
>>>>>>>>>> impstats
>>>>>>>>>> module, but
>>>>>>>>>> I don't think that existed back in the 4.x days, and tracking
>>>>>>>>>>down
>>>>>>>>>> the
>>>>>>>>>> bottleneck without it is significantly harder. Is there any way
>>>>>>>>>> you
>>>>>>>>>> can
>>>>>>>>>> upgrade
>>>>>>>>>> to a current version?
>>>>>>>>>>
>>>>>>>>>> David Lang
>>>>>>>>>>
>>>>>>>>>>  On Mon, 2 Dec 2013, Dan Finn wrote:
>>>>>>>>>>
>>>>>>>>>>> Date: Mon, 2 Dec 2013 20:53:54 +0000
>>>>>>>>>>> From: Dan Finn <[email protected]>
>>>>>>>>>>> Reply-To: rsyslog-users <[email protected]>
>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>> Subject: [rsyslog] rsyslog frequently queuing to disk when it
>>>>>>>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>>     sending over the network
>>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I¹m trying to get some insight into an issue that we have been
>>>>>>>>>>> seeing
>>>>>>>>>>> quite a bit.  We have some postgres servers that are quite
>>>>>>>>>>> verbose.
>>>>>>>>>>> When the servers get busy we have an issue where they queue
>>>>>>>>>>>their
>>>>>>>>>>> logs
>>>>>>>>>>> locally instead of sending over the network however I can¹t
>>>>>>>>>>>find
>>>>>>>>>>> any
>>>>>>>>>>> reason why that would be, at least not from a OS resource
>>>>>>>>>>> standpoint.
>>>>>>>>>>> We are running rsyslog4-4.8.0-1.ius.el5.  This is my config
>>>>>>>>>>>from
>>>>>>>>>>> the
>>>>>>>>>>> client that was having issues : http://pastebin.com/n3XpRdMm.
>>>>>>>>>>>
>>>>>>>>>>> I watched it queue about 10k files under /var/spool/rsyslog
>>>>>>>>>>> before
>>>>>>>>>>> I
>>>>>>>>>>> finally had to manually delete them out because disk was
>>>>>>>>>>>filling
>>>>>>>>>>> up.
>>>>>>>>>>>
>>>>>>>>>>> What¹s the best way to get some insight into why this might be
>>>>>>>>>>> happening?  Is there a way I can enable some debug logging for
>>>>>>>>>>> the
>>>>>>>>>>> rsyslog process itself?  Any settings in our config that could
>>>>>>>>>>>be
>>>>>>>>>>> tweaked?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Dan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> DAN FINN
>>>>>>>>>>>
>>>>>>>>>>> Linux System Administrator
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [email protected]<mailto:[email protected]>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Backcountry.com<http://www.backcountry.com/>
>>>>>>>>>>>
>>>>>>>>>>> Competitive Cyclist<http://www.competitivecyclist.com/>
>>>>>>>>>>>
>>>>>>>>>>> RealCyclist.com<http://www.realcyclist.com/>
>>>>>>>>>>>
>>>>>>>>>>> Dogfunk.com<http://www.dogfunk.com/>
>>>>>>>>>>>
>>>>>>>>>>> SteepandCheap.com<http://www.steepandcheap.com/>
>>>>>>>>>>>
>>>>>>>>>>> Chainlove.com<http://www.chainlove.com/>
>>>>>>>>>>>
>>>>>>>>>>> WhiskeyMilitia.com<http://www.whiskeymilitia.com/>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> rsyslog mailing list
>>>>>>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>>>>>>> http://www.rsyslog.com/professional-services/
>>>>>>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED
>>>>>>>>>>>by a
>>>>>>>>>>> myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO
>>>>>>>>>>>NOT
>>>>>>>>>>> POST
>>>>>>>>>>> if you DON'T LIKE THAT.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> rsyslog mailing list
>>>>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>>>>> http://www.rsyslog.com/professional-services/
>>>>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>>>>> myriad of
>>>>>>>>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if
>>>>>>>>>you
>>>>>>>>> DON'T
>>>>>>>>> LIKE THAT.
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> rsyslog mailing list
>>>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>>>> http://www.rsyslog.com/professional-services/
>>>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>>>> myriad of
>>>>>>>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if
>>>>>>>>you
>>>>>>>> DON'T
>>>>>>>> LIKE THAT.
>>>>>>
>>>>>> _______________________________________________
>>>>>> rsyslog mailing list
>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>> http://www.rsyslog.com/professional-services/
>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>> myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT
>>>>>>POST
>>>>>> if you DON'T LIKE THAT.
>>>>>
>>>>> _______________________________________________
>>>>> rsyslog mailing list
>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>> http://www.rsyslog.com/professional-services/
>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>> myriad
>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if
>>>>>you
>>>>> DON'T LIKE THAT.
>>>>
>>>> _______________________________________________
>>>> rsyslog mailing list
>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>> http://www.rsyslog.com/professional-services/
>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>> myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
>>>> if you DON'T LIKE THAT.
>>
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
>>if you DON'T LIKE THAT.

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] rsyslog frequently queuing to disk when it should be sending over the network

Reply via email to