Keep trying to get simplest possible case of the mysterious silences experienced when there are no messages sent for some time. As far as I can tell the core (or at least as close as I got so far) is some kind of timeout related to Amazon Elastic Load Balancer (ELB) after which it takes long time for rsyslog to recover.

This is with rsyslog 7.5.6-0adiscon2, latest librelp 1.2.0-0adiscon2 from Adiscon repo (didn't use the one I built from git master since I am not 100% it's working properly and I think it's unrelated to what we looked at before which was why the all the traffic stopped, now I isolated the problem to what I think is a simpler, more focused scenario).

  So this extra simple test setup looks like this:

  - ELB-receiver and receiver host behind it

- sender that sends messages received on /dev/log to receiver via ELB-receiver using RELP

As long as the messages are flowing this works fine, but if there is a gap (5 min) something goes wrong and sender rsyslog cannot send messages anymore (seems like enough activity later fixes this but not sure why/how).

I ran both rsyslogs in debug mode (sudo rsyslogd -dn). I sent a message to verify that it works, waited for 5 minutes and sent another message that was never (waited for 40 minutes) received.

Receiver debug logs: https://drive.google.com/file/d/0B5zeQB_ZuLw0NlItcG80MHJGdlk/edit?usp=sharing

Sender debug logs: https://drive.google.com/file/d/0B5zeQB_ZuLw0SlNhWGxvaFp3bE0/edit?usp=sharing

All messages have tag loggerTest, there are only few of them so they are easy to find.

  Here are the crucial parts of the logs (as far as I can tell).

  Sender (total three messages sent to ELB-receiver)):

9352.348965411:7faa2f273700: frame to send: '2 syslog 364 <134>2013-12-10T01:35:52.336324+00:00 ip-10-143-190-98 loggerTest: @cee:{"message-text":" Tue Dec 10 01:35:52 UTC 2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:35:52.336324+00:00"}}

...

9377.184286552:7faa2f273700: frame to send: '3 syslog 364 <134>2013-12-10T01:36:17.182507+00:00 ip-10-143-190-98 loggerTest: @cee:{"message-text":" Tue Dec 10 01:36:17 UTC 2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:36:17.182507+00:00"}}

... 5 min sleep ...

9677.200615063:7faa2f273700: frame to send: '4 syslog 364 <134>2013-12-10T01:41:17.198130+00:00 ip-10-143-190-98 loggerTest: @cee:{"message-text":" Tue Dec 10 01:41:17 UTC 2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:41:17.198130+00:00"}}

  Receiver (only first two messages are seen):

9352.360516081:7f20ebdb5700: relp session read 378 octets, buf '2 syslog 364 <134>2013-12-10T01:35:52.336324+00:00 ip-10-143-190-98 loggerTest: @cee:{"message-text":" Tue Dec 10 01:35:52 UTC 2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:35:52.336324+00:00"}}

...

9377.196178103:7f20ebdb5700: relp session read 378 octets, buf '3 syslog 364 <134>2013-12-10T01:36:17.182507+00:00 ip-10-143-190-98 loggerTest: @cee:{"message-text":" Tue Dec 10 01:36:17 UTC 2013","yummlyLogOrigin":{"supportLevel":"dev","system":"LOGS","cluster":"elbTest","role":"collector-2","host":"ip-10-143-190-98","tag":"loggerTest:","programname":"loggerTest","priority":"local0.info","timestamp":"2013-12-10T01:36:17.182507+00:00"}}

So clearly the last messages sent after 5 min timeout was never seen be reciever. Here's what sender logs when it tries to send the third message:

9678.918894060:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9678.919107278:7faa2fa74700: new connect on RELP socket #7
9678.919812242:7faa2fa74700: remote host is 'ip-10-2-61-142.ec2.internal', ip '10.2.61.142' 9678.919926630:7faa2fa74700: librelp: add socket 13 to epoll set (ptr 0x7faa20000cd0) 9678.920033052:7faa2fa74700: librelp: epoll_set_events sock 13, target bits 01, current 01
9678.920135420:7faa2fa74700: librelp: doing epoll_wait
9678.920230075:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9678.920332413:7faa2fa74700: relp session read 0 octets, buf ''
9678.920427762:7faa2fa74700: server closed relp session 0x7faa20000970, session broken
9678.920523722:7faa2fa74700: relp session 13 iRet 10007, tearing it down
9678.920639556:7faa2fa74700: librelp: delete sock 13 from epoll set
9678.920739978:7faa2fa74700: hint-frame to send: '0 serverclose 0
'
9678.920858076:7faa2fa74700: in destructor: sendbuf 0x7faa20000d30
9678.920991517:7faa2fa74700: librelp: doing epoll_wait
9679.278263028:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9679.278492564:7faa2fa74700: new connect on RELP socket #7
9679.279155908:7faa2fa74700: remote host is 'ip-10-2-61-142.ec2.internal', ip '10.2.61.142' 9679.279273086:7faa2fa74700: librelp: add socket 13 to epoll set (ptr 0x7faa20000cd0) 9679.279455323:7faa2fa74700: librelp: epoll_set_events sock 13, target bits 01, current 01
9679.279557743:7faa2fa74700: librelp: doing epoll_wait
9679.279656208:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9679.279760706:7faa2fa74700: relp session read 0 octets, buf ''
9679.279854582:7faa2fa74700: server closed relp session 0x7faa20000970, session broken
9679.279952424:7faa2fa74700: relp session 13 iRet 10007, tearing it down
9679.280047300:7faa2fa74700: librelp: delete sock 13 from epoll set
9679.280135789:7faa2fa74700: hint-frame to send: '0 serverclose 0
'
9679.280223030:7faa2fa74700: in destructor: sendbuf 0x7faa20000d30
9679.280321888:7faa2fa74700: librelp: doing epoll_wait
9708.918019310:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9708.918183627:7faa2fa74700: new connect on RELP socket #7
9708.919100708:7faa2fa74700: remote host is 'ip-10-2-61-142.ec2.internal', ip '10.2.61.142' 9708.919207876:7faa2fa74700: librelp: add socket 13 to epoll set (ptr 0x7faa20000cd0) 9708.919326358:7faa2fa74700: librelp: epoll_set_events sock 13, target bits 01, current 01
9708.919412125:7faa2fa74700: librelp: doing epoll_wait
9708.919478880:7faa2fa74700: librelp: done epoll_wait, nEvents:1
9708.919550546:7faa2fa74700: relp session read 0 octets, buf ''
9708.919615405:7faa2fa74700: server closed relp session 0x7faa20000970, session broken
9708.919680096:7faa2fa74700: relp session 13 iRet 10007, tearing it down
9708.919744131:7faa2fa74700: librelp: delete sock 13 from epoll set
9708.919813426:7faa2fa74700: hint-frame to send: '0 serverclose 0
'

that last part that repeats twice keeps repeating until rsyslog ends (ctrl-c 40 minutes later).

It's not clear from the logs whether it closes connection and opens another one. But during this time all the individual parts are working - both machines are fine, ELB thinks receiver is fine, all network connections are fine. So closing and opening a new connection should work.

Rainer already mentioned that rsyslog is closing and opening connection if there's any problem but then why would that fail? I can open a connection from the same machine to ELB (just using telnet ELB 5140) and the connection seems fine. So why would rsyslog not be able to open the connection?

Any ideas? I will try to get more details using strace and/or tcpdump (somewhat easier it this scenario since there is no other traffic) but I would appreciate any clarification of debug logs, not sure if I'm interpreting them correctly.

  thanks!

        erik
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to