On Fri, 2 Dec 2016, Arik Mitschang wrote:

Hi Rsyslog users,

We have been periodically experiencing an issue with our rsyslog setup
where some RELP relay nodes appear to fill up their queue and stop
processing any messages.

Our log flow essentially is made up of a number of "clients" that send
messages over RELP to one or more "relay" layers which finally send to a
number of rsyslog processes which index messages in elasticsearch, for
example:

client -> relay -> relay (x2) -> indexer (x2) -> elasticsearch

Each relay sends to at least 2 rsyslog servers balancing messages
between them using config like:

if $$uptime % 2 == 0 then {
RELP A
}
else if $$uptime % 1 == 0 then {
RELP B
}

During peak times we are pushing about 25000 messages per second on each
of the most busy relays and indexers (limited by the indexing
operation). The "relays" do not write queue to disk.

The problem has always been that one or more "relays" simply stops
forwarding, inspection of the process shows memory usage higher than
others as the queue is full. Normally, restarting the rsyslog process
clears the queue and resumes normal processing.

This looks like a bug, and perhaps gets triggered by some badly formed
or encoded incoming message or something (noting this is also a largely
Japanese environment), but I was curious if anyone here has experienced
similar or knows where to look or any suggestions how to get useful
information to report about this.

I appreciate any help you can give, thanks,

the problem is probably not on the system that stops forwarding messages, but rather on the system they are forwarding the messages to.

When the queues fill up, unles you have configured rsyslog to throw away messages, it will stop accepting any new messages as it can't put them in the queue. This is "working as designed" (one of these days I've got to sit down and finish writing my "how to make your logs unreliable" article :-)

what version are you running, there have been some unicode related fixes in the last few versions.

A couple things to do would be

1. make sure you have impstats enabled, and since you are having problems delivering messages, make sure it either uses a different ruleset (with a queue) or writes a file to disk so that you don't risk the pstats data getting stuck as well.

2. as a debugging tool, consider writing the logs to disk before forwarding them. You don't need to keep a very long history of them, but seeing the message that rsyslog was trying to send could be very helpful

3. look at the systems receiving the messages to see if anything odd happens there around the time that things start failing.

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to