Keep trying to get simplest possible case of the mysterious silences
experienced when there are no messages sent for some time. As far as I
can tell the core (or at least as close as I got so far) is some kind of
timeout related to Amazon Elastic Load Balancer (ELB) after which it
takes long time for rsyslog to recover.
This is with rsyslog 7.5.6-0adiscon2, latest librelp 1.2.0-0adiscon2
from Adiscon repo (didn't use the one I built from git master since I am
not 100% it's working properly and I think it's unrelated to what we
looked at before which was why the all the traffic stopped, now I
isolated the problem to what I think is a simpler, more focused scenario).
So this extra simple test setup looks like this:
- ELB-receiver and receiver host behind it
- sender that sends messages received on /dev/log to receiver via
ELB-receiver using RELP
As long as the messages are flowing this works fine, but if there is
a gap (5 min) something goes wrong and sender rsyslog cannot send
messages anymore (seems like enough activity later fixes this but not
sure why/how).
I ran both rsyslogs in debug mode (sudo rsyslogd -dn). I sent a
message to verify that it works, waited for 5 minutes and sent another
message that was never (waited for 40 minutes) received.
Receiver debug logs:
https://drive.google.com/file/d/0B5zeQB_ZuLw0NlItcG80MHJGdlk/edit?usp=sharing
Sender debug logs:
https://drive.google.com/file/d/0B5zeQB_ZuLw0SlNhWGxvaFp3bE0/edit?usp=sharing
All messages have tag loggerTest, there are only few of them so they
are easy to find.
Here are the crucial parts of the logs (as far as I can tell).
Sender (total three messages sent to ELB-receiver)):
9352.348965411:7faa2f273700: frame to send: '2 syslog 364
<134>2013-12-10T01:35:52.336324+00:00 ip-10-143-190-98 loggerTest:
@cee:{"message-text":" Tue Dec 10 01:35:52 UTC
2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:35:52.336324+00:00"}}
...
9377.184286552:7faa2f273700: frame to send: '3 syslog 364
<134>2013-12-10T01:36:17.182507+00:00 ip-10-143-190-98 loggerTest:
@cee:{"message-text":" Tue Dec 10 01:36:17 UTC
2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:36:17.182507+00:00"}}
... 5 min sleep ...
9677.200615063:7faa2f273700: frame to send: '4 syslog 364
<134>2013-12-10T01:41:17.198130+00:00 ip-10-143-190-98 loggerTest:
@cee:{"message-text":" Tue Dec 10 01:41:17 UTC
2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:41:17.198130+00:00"}}
Receiver (only first two messages are seen):
9352.360516081:7f20ebdb5700: relp session read 378 octets, buf '2 syslog
364 <134>2013-12-10T01:35:52.336324+00:00 ip-10-143-190-98 loggerTest:
@cee:{"message-text":" Tue Dec 10 01:35:52 UTC
2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:35:52.336324+00:00"}}
...
9377.196178103:7f20ebdb5700: relp session read 378 octets, buf '3 syslog
364 <134>2013-12-10T01:36:17.182507+00:00 ip-10-143-190-98 loggerTest:
@cee:{"message-text":" Tue Dec 10 01:36:17 UTC
2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:36:17.182507+00:00"}}
So clearly the last messages sent after 5 min timeout was never seen
be reciever. Here's what sender logs when it tries to send the third
message:
9678.918894060:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9678.919107278:7faa2fa74700: new connect on RELP socket #7
9678.919812242:7faa2fa74700: remote host is
'ip-10-2-61-142.ec2.internal', ip '10.2.61.142'
9678.919926630:7faa2fa74700: librelp: add socket 13 to epoll set (ptr
0x7faa20000cd0)
9678.920033052:7faa2fa74700: librelp: epoll_set_events sock 13, target
bits 01, current 01
9678.920135420:7faa2fa74700: librelp: doing epoll_wait
9678.920230075:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9678.920332413:7faa2fa74700: relp session read 0 octets, buf ''
9678.920427762:7faa2fa74700: server closed relp session 0x7faa20000970,
session broken
9678.920523722:7faa2fa74700: relp session 13 iRet 10007, tearing it down
9678.920639556:7faa2fa74700: librelp: delete sock 13 from epoll set
9678.920739978:7faa2fa74700: hint-frame to send: '0 serverclose 0
'
9678.920858076:7faa2fa74700: in destructor: sendbuf 0x7faa20000d30
9678.920991517:7faa2fa74700: librelp: doing epoll_wait
9679.278263028:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9679.278492564:7faa2fa74700: new connect on RELP socket #7
9679.279155908:7faa2fa74700: remote host is
'ip-10-2-61-142.ec2.internal', ip '10.2.61.142'
9679.279273086:7faa2fa74700: librelp: add socket 13 to epoll set (ptr
0x7faa20000cd0)
9679.279455323:7faa2fa74700: librelp: epoll_set_events sock 13, target
bits 01, current 01
9679.279557743:7faa2fa74700: librelp: doing epoll_wait
9679.279656208:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9679.279760706:7faa2fa74700: relp session read 0 octets, buf ''
9679.279854582:7faa2fa74700: server closed relp session 0x7faa20000970,
session broken
9679.279952424:7faa2fa74700: relp session 13 iRet 10007, tearing it down
9679.280047300:7faa2fa74700: librelp: delete sock 13 from epoll set
9679.280135789:7faa2fa74700: hint-frame to send: '0 serverclose 0
'
9679.280223030:7faa2fa74700: in destructor: sendbuf 0x7faa20000d30
9679.280321888:7faa2fa74700: librelp: doing epoll_wait
9708.918019310:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9708.918183627:7faa2fa74700: new connect on RELP socket #7
9708.919100708:7faa2fa74700: remote host is
'ip-10-2-61-142.ec2.internal', ip '10.2.61.142'
9708.919207876:7faa2fa74700: librelp: add socket 13 to epoll set (ptr
0x7faa20000cd0)
9708.919326358:7faa2fa74700: librelp: epoll_set_events sock 13, target
bits 01, current 01
9708.919412125:7faa2fa74700: librelp: doing epoll_wait
9708.919478880:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9708.919550546:7faa2fa74700: relp session read 0 octets, buf ''
9708.919615405:7faa2fa74700: server closed relp session 0x7faa20000970,
session broken
9708.919680096:7faa2fa74700: relp session 13 iRet 10007, tearing it down
9708.919744131:7faa2fa74700: librelp: delete sock 13 from epoll set
9708.919813426:7faa2fa74700: hint-frame to send: '0 serverclose 0
'
that last part that repeats twice keeps repeating until rsyslog ends
(ctrl-c 40 minutes later).
It's not clear from the logs whether it closes connection and opens
another one. But during this time all the individual parts are working -
both machines are fine, ELB thinks receiver is fine, all network
connections are fine. So closing and opening a new connection should work.
Rainer already mentioned that rsyslog is closing and opening
connection if there's any problem but then why would that fail? I can
open a connection from the same machine to ELB (just using telnet ELB
5140) and the connection seems fine. So why would rsyslog not be able to
open the connection?
Any ideas? I will try to get more details using strace and/or tcpdump
(somewhat easier it this scenario since there is no other traffic) but I
would appreciate any clarification of debug logs, not sure if I'm
interpreting them correctly.
thanks!
erik
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.