On Sun, 15 Dec 2013, Erik Steffl wrote:
On 12/15/2013 08:41 PM, David Lang wrote:
On Sun, 15 Dec 2013, Erik Steffl wrote:
changed config so now the MARK messages are sent (and received). BTW
they use kern.info facility, not sure why it's not the same as what
Rainer found in source (syslog.info).
This solves the simple test case in which I only have sender, load
balancer (ELB) and receiver. I strace both sender and receiver, see
the MARK messages and the connection is fine, i.e. I send a log
message using e.g. logger, then wait 5 min, then send another one and
it works. Without MARK the second message never arrives.
However in a more complex scenario this does not help at all. Complex
scenario looks like this (ascii arrows are flows of syslog messages):
- 6 machines -> ELB-prod -> collector-prod
- 1 machine -> ELB-test -> collector-test
- every 5 minutes: collector-test -> ELB-prod -> collector-prod
- every 5 minutes: collector-prod -> ELB-prod -> collector-prod (yes,
program on collector-prod sends message to rsyslog on collector-prod
over ELB)
that 5 minute pause cause the connection to go stale somehow which
results in periods of silence. I configured both collector-prod and
collector-test to send MARK messages (to collector-prod since that's
where they send regular messages) but I still see the periods of
silence (on both collectors). Used strace to verify that MARK messages
arrive (I guess it's possible that I confused the MARK messages from
collector-prod and collector-test, will continue investigating on that
front)
However adding cron entry to send few log messages every minute DOES
solve the problem (there is no silence anymore).
Any ideas why that would be? Is it possible that MARK messages are
being sent through different connections than other messages?
As far as I can tell the only difference between MARK messages and
the cron'd messages is that the MARK messages are generated by immark
and use kern.info facilty and the cron'd messages arrive via /dev/log
and use local0.info facility.
Any ideas why would the simpler scenario behave differently than the
complex scenario? Or why MARK messages do not solve the problem in
complex scenario while cron'd messages do?
my guess is that you have some bug in your config that is not forwarding
the mark messages in the more complex case.
There really isn't a difference as far as this problem is concerned
between cron generating the message and immark generating the message.
As you note, it's just which process does the work.
I thought so too but I am looking at strace of receiving syslog and it is a
bit strange. I definitely see the MARK messages coming through, as well as
the cron'd messages (each once a minute).
Since the messages have hostname in them I am pretty sure they are coming
from a given machine (collector-test sends them to collector-prod).
There seems to be something styrange about the MARK messages. I see the
cron'd messages sent every minute and each is one recvfrom syscall. I also
see some MARK messages like that but then I see recvfrom syscall that has LOT
of MARK messages (from recent past).
I also do not see MARK messages being sent every minute which I guess is OK
because they are only sent if there is no traffic. Then again why are they
being sent at all if there are cron'd messages sent every minute?
This is a problem. The thing about mark messages only being sent if there is no
other traffic is obsolete at this point. With the complexity of filtering that's
now available, having an input module like mark figure out when the last message
was sent to a particular output is next to impossible, especially since the mark
message may not get to that output anyway. somewhere around v3 or v4 mark got
changed to just send at the specified interval, so if they aren't going through,
something is wrong.
do the local logs show a mark every minute and the missing ones are only on the
remote side? or are there gaps in the local copies as well?
Here's what it looks like (added some newliones to make it slightly more
readable:
[pid 32602] 06:48:57.177317 recvfrom(149, "10 syslog 335
<6>2013-12-16T06:48:56.164906+00:00 ip-10-158-97-169 rsyslogd:
@cee:{\"message-text\":\"-- MARK
--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:48:56.164906+00:00\"}}\n\n",
32768, MSG_DONTWAIT, NULL, NULL) = 350
[pid 32602] 06:49:03.241900 recvfrom(145, "409 syslog 389
<134>2013-12-16T06:49:01.597357+00:00 ip-10-158-97-169 logBurst[27810]:
@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Mon Dec 16 06:49:01
2013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[27810]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-16T06:49:01.597357+00:00\"}}\n\n",
32768, MSG_DONTWAIT, NULL, NULL) = 405
[pid 32602] 06:50:03.187597 recvfrom(145, "410 syslog 389
<134>2013-12-16T06:50:01.793862+00:00 ip-10-158-97-169 logBurst[27813]:
@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Mon Dec 16 06:50:01
2013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[27813]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-16T06:50:01.793862+00:00\"}}\n\n",
32768, MSG_DONTWAIT, NULL, NULL) = 405
[pid 32602] 06:50:59.426536 recvfrom(149, "2 syslog 335
<6>2013-12-16T06:40:55.748869+00:00 ip-10-158-97-169 rsyslogd:
@cee:{\"message-text\":\"-- MARK
--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:40:55.748869+00:00\"}}\n\n3
syslog 335 <6>2013-12-16T06:41:55.808933+00:00 ip-10-158-97-169 rsyslogd:
@cee:{\"message-text\":\"-- MARK
--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:41:55.808933+00:00\"}}\n\n4
syslog 335 <6>2013-12-16T06:42:55.869729+00:00 ip-10-158-97-169 rsyslogd:
@cee:{\"message-text\":\"-- MARK
--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:42:55.869729+00:00\"}}\n\n5
syslog 335 <6>2013-12-16T06:43:55.895209+00:00 ip-10-158-97-169 rsyslogd:
@cee:{\"message-text\":\"-- MARK
--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:43:55.895209+00:00\"}}\n\n6
syslog 335 <6>2013-12-16T06:44:55.916881+00:00 ip-10-158-97-169 rsyslogd:
@cee:{\"message-text\":\"-- MARK
--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"programname\":\"rsyslogd\",\"priority\":\"kern.info\",\"timestamp\":\"2013-12-16T06:44:55.916881+00:00\"}}\n\n7
syslog 335 <6>2013-12-16T06:45:55.980382+00:00 ip-10-158-97-169 rsyslogd:
@cee:{\"message-text\":\"-- MARK
--\",\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"rsyslogd:\",\"pro"...,
32768, MSG_DONTWAIT, NULL, NULL) = 3842
the timestamps here show that the messages are being generated correctly, once a
minute, but that they are being delivered in a batch for some reason. This is
looking like a potential RELP bug. I'm assuming that you have made sure that
your systems are all on the same time. These logs are being delayed 10 minutes,
they are also arriving after other logs
now, one reason for this sort of thing could be that if you have multiple worker
threads, one thread could be 'claiming' a batch of messages, including ones like
this, but then as that worker thread is going through it's work, it could be
getting stuck sending to some other destination. while it's stuck, another
worker thread graps some of the newer messages and is able to deliver them (it
may not have any messages in it's batch that have to go to the 'stuck'
destination)
the more I am going through this explination, the more likely I think it is to
be the problem.
Now the question is what outputs do you have that could be getting stuck like
this? it could even be getting stuck on the RELP connections if the load
balancer is misbehaving badly enough.
what's your config look like here? I know you have a bunch of filtering and IIRC
different rulesets in use.
[pid 32602] 06:51:02.460475 recvfrom(145, "411 syslog 389
<134>2013-12-16T06:51:01.985229+00:00 ip-10-158-97-169 logBurst[27816]:
@cee:{\"message-json\":{\"count\":\"0/1\",\"now\":\"Mon Dec 16 06:51:01
2013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[27816]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-16T06:51:01.985229+00:00\"}}\n\n",
32768, MSG_DONTWAIT, NULL, NULL) = 405
06:48:57.177317 MARK
06:49:03.241900 logBurst (1 cron'd message)
06:50:03.187597 logBurst
06:50:59.426536 MARK timestamped 06:40:55.748869 etc.
06:51:02.460475 logBurst
The same group of MARK messages was sent at least once before (or at least
group with same first message).
I'm not seeing that in this example.
Why would the MARK message be resend in a group (or send for the first time
maybe, didn't get that far yet).
Seems like one of two scenarios happens:
- MARK messages are for some reason grouped together, so they end up not
being sent over the wire every minute (and they don't prevent the connection
from going bad).
- MARK messages are sent but relp confirmation is not sent so they keep
being re-send. Note that cron'd message aree not resent so there is nothing
wrong with connection and nothing preventing relp confirmations to be sent
back.
Could this have anything to do with RELP window, i.e. how many messages are
sent before relp confirmation is sent?
possible, especially if the network is flaky and the confirmation is getting
lost to frequently.
but I think you have some outputs that are getting stuck and the multiple worker
threads are reordering messages as they run into a message in a batch that is
causing them to get stuck.
David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.