Re: [rsyslog] immark - how to use with action(...)

David Lang Mon, 16 Dec 2013 00:09:57 -0800

On Sun, 15 Dec 2013, Erik Steffl wrote:

On 12/15/2013 08:41 PM, David Lang wrote:

On Sun, 15 Dec 2013, Erik Steffl wrote:

 changed config so now the MARK messages are sent (and received). BTW
they use kern.info facility, not sure why it's not the same as what
Rainer found in source (syslog.info).

 This solves the simple test case in which I only have sender, load
balancer (ELB) and receiver. I strace both sender and receiver, see
the MARK messages and the connection is fine, i.e. I send a log
message using e.g. logger, then wait 5 min, then send another one and
it works. Without MARK the second message never arrives.

 However in a more complex scenario this does not help at all. Complex
scenario looks like this (ascii arrows are flows of syslog messages):

 - 6 machines -> ELB-prod -> collector-prod

 - 1 machine -> ELB-test -> collector-test

 - every 5 minutes: collector-test -> ELB-prod -> collector-prod

 - every 5 minutes: collector-prod -> ELB-prod -> collector-prod (yes,
program on collector-prod sends message to rsyslog on collector-prod
over ELB)

 that 5 minute pause cause the connection to go stale somehow which
results in periods of silence. I configured both collector-prod and
collector-test to send MARK messages (to collector-prod since that's
where they send regular messages) but I still see the periods of
silence (on both collectors). Used strace to verify that MARK messages
arrive (I guess it's possible that I confused the MARK messages from
collector-prod and collector-test, will continue investigating on that
front)

 However adding cron entry to send few log messages every minute DOES
solve the problem (there is no silence anymore).

 Any ideas why that would be? Is it possible that MARK messages are
being sent through different connections than other messages?

 As far as I can tell the only difference between MARK messages and
the cron'd messages is that the MARK messages are generated by immark
and use kern.info facilty and the cron'd messages arrive via /dev/log
and use local0.info facility.

 Any ideas why would the simpler scenario behave differently than the
complex scenario? Or why MARK messages do not solve the problem in
complex scenario while cron'd messages do?


my guess is that you have some bug in your config that is not forwarding
the mark messages in the more complex case.

There really isn't a difference as far as this problem is concerned
between cron generating the message and immark generating the message.
As you note, it's just which process does the work.

I thought so too but I am looking at strace of receiving syslog and it is abit strange. I definitely see the MARK messages coming through, as well asthe cron'd messages (each once a minute).

Since the messages have hostname in them I am pretty sure they are comingfrom a given machine (collector-test sends them to collector-prod).

There seems to be something styrange about the MARK messages. I see thecron'd messages sent every minute and each is one recvfrom syscall. I alsosee some MARK messages like that but then I see recvfrom syscall that has LOTof MARK messages (from recent past).

I also do not see MARK messages being sent every minute which I guess is OKbecause they are only sent if there is no traffic. Then again why are theybeing sent at all if there are cron'd messages sent every minute?

This is a problem. The thing about mark messages only being sent if there is noother traffic is obsolete at this point. With the complexity of filtering that'snow available, having an input module like mark figure out when the last messagewas sent to a particular output is next to impossible, especially since the markmessage may not get to that output anyway. somewhere around v3 or v4 mark gotchanged to just send at the specified interval, so if they aren't going through,something is wrong.

do the local logs show a mark every minute and the missing ones are only on theremote side? or are there gaps in the local copies as well?

Here's what it looks like (added some newliones to make it slightly morereadable:
[pid 32602] 06:48:57.177317 recvfrom(149, "10 syslog 335<6>2013-12-16T06:48:56.164906+00:00 ip-10-158-97-169 rsyslogd:@cee:{\"message-text\":\"-- MARK--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:48:56.164906+00:00\"}}\n\n",32768, MSG_DONTWAIT, NULL, NULL) = 350
[pid 32602] 06:49:03.241900 recvfrom(145, "409 syslog 389<134>2013-12-16T06:49:01.597357+00:00 ip-10-158-97-169 logBurst[27810]:@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Mon Dec 16 06:49:012013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[27810]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-16T06:49:01.597357+00:00\"}}\n\n",32768, MSG_DONTWAIT, NULL, NULL) = 405
[pid 32602] 06:50:03.187597 recvfrom(145, "410 syslog 389<134>2013-12-16T06:50:01.793862+00:00 ip-10-158-97-169 logBurst[27813]:@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Mon Dec 16 06:50:012013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[27813]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-16T06:50:01.793862+00:00\"}}\n\n",32768, MSG_DONTWAIT, NULL, NULL) = 405
[pid 32602] 06:50:59.426536 recvfrom(149, "2 syslog 335<6>2013-12-16T06:40:55.748869+00:00 ip-10-158-97-169 rsyslogd:@cee:{\"message-text\":\"-- MARK--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:40:55.748869+00:00\"}}\n\n3syslog 335 <6>2013-12-16T06:41:55.808933+00:00 ip-10-158-97-169 rsyslogd:@cee:{\"message-text\":\"-- MARK--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:41:55.808933+00:00\"}}\n\n4syslog 335 <6>2013-12-16T06:42:55.869729+00:00 ip-10-158-97-169 rsyslogd:@cee:{\"message-text\":\"-- MARK--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:42:55.869729+00:00\"}}\n\n5syslog 335 <6>2013-12-16T06:43:55.895209+00:00 ip-10-158-97-169 rsyslogd:@cee:{\"message-text\":\"-- MARK--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:43:55.895209+00:00\"}}\n\n6syslog 335 <6>2013-12-16T06:44:55.916881+00:00 ip-10-158-97-169 rsyslogd:@cee:{\"message-text\":\"-- MARK--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:44:55.916881+00:00\"}}\n\n7syslog 335 <6>2013-12-16T06:45:55.980382+00:00 ip-10-158-97-169 rsyslogd:@cee:{\"message-text\":\"-- MARK--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"pro"...,32768, MSG_DONTWAIT, NULL, NULL) = 3842

the timestamps here show that the messages are being generated correctly, once aminute, but that they are being delivered in a batch for some reason. This islooking like a potential RELP bug. I'm assuming that you have made sure thatyour systems are all on the same time. These logs are being delayed 10 minutes,they are also arriving after other logs

now, one reason for this sort of thing could be that if you have multiple workerthreads, one thread could be 'claiming' a batch of messages, including ones likethis, but then as that worker thread is going through it's work, it could begetting stuck sending to some other destination. while it's stuck, anotherworker thread graps some of the newer messages and is able to deliver them (itmay not have any messages in it's batch that have to go to the 'stuck'destination)

the more I am going through this explination, the more likely I think it is tobe the problem.

Now the question is what outputs do you have that could be getting stuck likethis? it could even be getting stuck on the RELP connections if the loadbalancer is misbehaving badly enough.

what's your config look like here? I know you have a bunch of filtering and IIRCdifferent rulesets in use.

[pid 32602] 06:51:02.460475 recvfrom(145, "411 syslog 389<134>2013-12-16T06:51:01.985229+00:00 ip-10-158-97-169 logBurst[27816]:@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Mon Dec 16 06:51:012013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[27816]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-16T06:51:01.985229+00:00\"}}\n\n",32768, MSG_DONTWAIT, NULL, NULL) = 405
06:48:57.177317 MARK
06:49:03.241900 logBurst (1 cron'd message)
06:50:03.187597 logBurst
06:50:59.426536 MARK timestamped 06:40:55.748869 etc.
06:51:02.460475 logBurst
The same group of MARK messages was sent at least once before (or at leastgroup with same first message).


I'm not seeing that in this example.

Why would the MARK message be resend in a group (or send for the first timemaybe, didn't get that far yet).
 Seems like one of two scenarios happens:
- MARK messages are for some reason grouped together, so they end up notbeing sent over the wire every minute (and they don't prevent the connectionfrom going bad).
- MARK messages are sent but relp confirmation is not sent so they keepbeing re-send. Note that cron'd message aree not resent so there is nothingwrong with connection and nothing preventing relp confirmations to be sentback.
Could this have anything to do with RELP window, i.e. how many messages aresent before relp confirmation is sent?

possible, especially if the network is flaky and the confirmation is gettinglost to frequently.

but I think you have some outputs that are getting stuck and the multiple workerthreads are reordering messages as they run into a message in a batch that iscausing them to get stuck.


David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] immark - how to use with action(...)

Reply via email to