[pid 8503] 04:15:05.295163 recvfrom(173, "2 syslog 389 <134>2013-12-07T04:10:01.577241+00:00 ip-10-158-97-169 logBurst[11492]: @cee:{\"message-json\":{\"count\":100,\"START\":\"Sat Dec 7 04:10:01 2013\"},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[11492]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-07T04:10:01.577241+00:00\"}}\n\n3 syslog 381 <134>2013-12-07T04:10:01.577610+00:00 ip-10-158-97-169 logBurst[11492]: @cee:{\"message-json\":{\"now\":\"Sat Dec 7 04:10:01 2013\",\"i\":0},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":\"prod\",\"role\":\"collectorErik\",\"host\":\"ip-10-158-97-169\",\"tag\":\"logBurst[11492]:\",\"programname\":\"logBurst\",\"priority\":\"local0.info\",\"timestamp\":\"2013-12-07T04:10:01.577610+00:00\"}}\n\n4 syslog 381 <134>2013-12-07T04:10:01.577868+00:00 ip-10-158-97-169 logBurst[11492]: @cee:{\"message-json\":{\"now\":\"Sat Dec 7 04:10:01 2013\",\"i\":1},\"yummlyLogOrigin\":{\"supportLevel\":\"prod\",\"system\":\"LOGS"..., 32768, MSG_DONTWAIT, NULL, NULL) = 32768
as you can see in the message it was actually sent at 4:10 but was received at 4:15 (the strave date). So I guess the total silence is just a worst case of the problem that happens much more often.
erik
On 12/06/2013 06:59 PM, Erik Steffl wrote:
getting some messages about suspend/resume in /var/log/syslog, but this time I don't see any periods of silence (sometime they disappear, not sure why). There was one silence at 2013-12-7 2:20 - 2:25 UTC time (all times in logs are UTC as well) Filtered out pstats messages, full syslog is at: https://drive.google.com/file/d/0B5zeQB_ZuLw0c010VnhfVkZLSWs/edit?usp=sharing Dec 7 01:55:01 ip-10-158-97-169 CRON[11194]: (ubuntu) CMD (/home/ubuntu/logBurst.pl --count 100) Dec 7 01:56:01 ip-10-158-97-169 rsyslogd-3003: action '(null)' suspended, next retry is Sat Dec 7 01:56:31 2013 [try http://www.rsyslog.com/e/3003 ] Dec 7 01:57:30 ip-10-158-97-169 rsyslogd-3003: action '(null)' suspended, next retry is Sat Dec 7 01:58:00 2013 [try http://www.rsyslog.com/e/3003 ] Dec 7 01:58:59 ip-10-158-97-169 rsyslogd-3003: action '(null)' suspended, next retry is Sat Dec 7 01:59:29 2013 [try http://www.rsyslog.com/e/3003 ] Dec 7 02:00:01 ip-10-158-97-169 CRON[11197]: (ubuntu) CMD (/home/ubuntu/logBurst.pl --count 100) Dec 7 02:00:28 ip-10-158-97-169 rsyslogd-3003: action '(null)' suspended, next retry is Sat Dec 7 02:00:58 2013 [try http://www.rsyslog.com/e/3003 ] Dec 7 02:01:57 ip-10-158-97-169 rsyslogd-3003: action '(null)' suspended, next retry is Sat Dec 7 02:02:27 2013 [try http://www.rsyslog.com/e/3003 ] Dec 7 02:03:26 ip-10-158-97-169 rsyslogd-3003: action '(null)' suspended, next retry is Sat Dec 7 02:03:56 2013 [try http://www.rsyslog.com/e/3003 ] Dec 7 02:03:56 ip-10-158-97-169 rsyslogd0: action '(null)' resumed [try http://www.rsyslog.com/e/0 ] Dec 7 02:05:01 ip-10-158-97-169 CRON[11200]: (ubuntu) CMD (/home/ubuntu/logBurst.pl --count 100) Dec 7 02:05:02 ip-10-158-97-169 rsyslogd0: action '(null)' resumed [try http://www.rsyslog.com/e/0 ] Dec 7 02:10:01 ip-10-158-97-169 CRON[11203]: (ubuntu) CMD (/home/ubuntu/logBurst.pl --count 100) Dec 7 02:15:01 ip-10-158-97-169 CRON[11207]: (ubuntu) CMD (/home/ubuntu/logBurst.pl --count 100) Dec 7 02:17:01 ip-10-158-97-169 CRON[11211]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 7 02:20:01 ip-10-158-97-169 CRON[11346]: (ubuntu) CMD (/home/ubuntu/logBurst.pl --count 100) Dec 7 02:25:01 ip-10-158-97-169 CRON[11349]: (ubuntu) CMD (/home/ubuntu/logBurst.pl --count 100) Dec 7 02:25:54 ip-10-158-97-169 rsyslogd0: action '(null)' resumed [try http://www.rsyslog.com/e/0 ] Dec 7 02:30:01 ip-10-158-97-169 CRON[11352]: (ubuntu) CMD (/home/ubuntu/logBurst.pl --count 100) Dec 7 02:31:14 ip-10-158-97-169 rsyslogd0: action '(null)' resumed [try http://www.rsyslog.com/e/0 ] Dec 7 02:35:01 ip-10-158-97-169 CRON[11355]: (ubuntu) CMD (/home/ubuntu/logBurst.pl --count 100) the CRON logs saying that logBurst was ran show when the messages where sent from this host (collector-test) to another rsyslog host (collector-prod). Other than that nothing else should be sent between these two hosts. There's a steady traffic of incoming messages to this host that are being written to files. Let me know if I can provide more info or run different scenarios to help figure out what's going on. thanks! erik On 12/06/2013 01:04 AM, Rainer Gerhards wrote:On Fri, Dec 6, 2013 at 10:00 AM, Erik Steffl <[email protected]> wrote:On 12/05/2013 11:27 PM, Rainer Gerhards wrote:On Fri, Dec 6, 2013 at 8:17 AM, Erik Steffl <[email protected]> wrote: compiled it successfully but didn't try it yet, was trying to figureout what's going on with ELB and timeout. Will try it tomorrow (fri pacific time), what will I be looking for?sorry, don't understand that sentence.is the new version going to print additional logs to /var/log/syslog?yes, it will log messages when an action is suspended and when it is reactivated again. I think I also added stats counters for these events. But was already a couple of days (and > 100 commits) ago, so I don't know exactly out of my head.Or debug logs? Would these be triggered when there is timeout waiting for RELP response?It's triggered whenever the action decides it fails. That's exactly what I would like to know. RainerDo you think it will log anything interesting given the new information Ijust posted? Honestly, I haven't look deeply at it, because I need the otherinformation first. Don't waste time on trying to find out things if you don't have the basics...alright, will test with rsyslog I complied from git (master) and see what happens, thanks, erikRainer thanks!erik On 12/05/2013 10:34 PM, Rainer Gerhards wrote: Erik,have you tried the version I created for you? What does it report? Rainer On Fri, Dec 6, 2013 at 5:10 AM, Erik Steffl <[email protected]> wrote: at this point pretty sure it's related to Amazon Elastic Load Balancer(ELB) and some kind of timeout (it only happens if I use ELB and there is enough time between bursts of messages, e.g. 5 min triggers failure, 1 minute does not). Now I am back to original failure scenario (ELB and burst of 100 messages every 5 minutes). Here's output from lsof: ubuntu@ip-10-158-97-169:~$ sudo lsof | grep rsyslogd | grep TCP rsyslogd 9743 syslog 6u IPv4 7177271 0t0 TCP *:5140 (LISTEN) rsyslogd 9743 syslog 7u IPv6 7177272 0t0 TCP *:5140 (LISTEN) rsyslogd 9743 syslog 10u IPv4 7187191 0t0 TCP ip-10-158-97-169.ec2.internal:48176->ec2-50-19-250-187. compute-1.amazonaws.com:5140 (CLOSE_WAIT) rsyslogd 9743 syslog 12u IPv4 7183726 0t0 TCP ip-10-158-97-169.ec2.internal:5140->216.133.125.82:16543( ESTABLISHED) where: ip-10-158-97-169.ec2.internal is the collector-test, the host that is sending messages ec2-50-19-250-187.compute-1.amazonaws.com is ELB Strace shows some interesting stuff, even though I didn't catch the beginning of the problem this time (running strace with -f options so the first number on each line is thread pid): 9842 03:08:44.391025 sendto(10, "204 syslog 380 <134>2013-12-06T02:55:01.747931+00:00 ip-10-158-97-169 logBurst[9908]: @cee:{\"message-json\":{\"now\":\"Fri Dec 6 02:55:01 2013\",\"i\":99},\"yummlyLogOrigin\":{\" supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\": \"prod\",\"role\":\"collectorErik\",\"host\":\"ip- 10-158-97-169\",\"tag\":\"logBurst[9908]:\",\" programname\":\"logBurst\",\"priority\":\"local0.info\",\" timestamp\":\"2013-12-06T02:55:01.747931+00:00\"}}\n\n", 396, 0, NULL, 0 <unfinished ...> ... then nothing related to fd 10 until: 9842 03:10:32.676803 <... sendto resumed> ) = -1 ETIMEDOUT (Connection timed out) 9842 03:10:32.676887 gettimeofday({1386299432, 676918}, NULL) = 0 9842 03:10:32.676994 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0 9842 03:10:32.677101 close(10) = 0 So I assume that means the connection is bad, rsyslog figures it out and closes the connection. Then rsyslog looks up IP of ELB and opens another connection: 9842 03:10:32.687983 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10 9842 03:10:32.688088 connect(10, {sa_family=AF_INET, sin_port=htons(5140), sin_addr=inet_addr("50.19.250.187")}, 16) = 0 9842 03:10:32.689592 clock_gettime(CLOCK_REALTIME, {1386299432, 689625848}) = 0 9842 03:10:32.689694 sendto(10, "1 open 85 relp_version=0\nrelp_software= librelp,1.2.0,http://librelp.adiscon.com\ncommands=syslog\n", 96, 0, NULL, 0) = 96 9842 03:10:32.689812 recvfrom(10, 0x7ffc58b53490, 32768, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) ... 9842 03:11:32.214608 recvfrom(10, "", 32768, MSG_DONTWAIT, NULL, NULL) = 0 ... 9750 03:12:16.994327 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8 <unfinished ...> 9750 03:12:16.994443 close(10 <unfinished ...> 9750 03:12:16.994597 <... close resumed> ) = 0 So it seems like rsyslog is trying to open a new RELP session but it's getting ETIMEDOUT. This is very weird because all involved hosts (sender, ELB, receiver) work fine at the same time, they continue to send and receive data, ELB thinks the receiver is healthy etc. Everything seems to be working fine except rsyslog is not able to open working connection to ELB (to recevier via ELB). Any ideas how to investigate this further? This seems to be a lot more specific then what I had before (just mysterious silence) so hopefully it makes more sense... thanks! erik On 11/30/2013 02:12 AM, Rainer Gerhards wrote: If the connection fails, rsyslog *immediately* tries to re-open, butthat seems to fail as well. Btw: git v7-devel and master branch niw contain the changes i made. Could you please try one of these versions. Sent from phone, thus brief. Am 30.11.2013 00:39 schrieb "Erik Steffl" <[email protected]>: On 11/29/2013 01:48 AM, Rainer Gerhards wrote:On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]> wrote:On 11/26/2013 07:06 AM, Pavel Levshin wrote:26.11.2013 1:20, Erik Steffl:what does the above mean exactly? It does make sense, in each casethe burst of messages gets in via /dev/log then is send out via RELP (either to the same machine or a different one). Any ideas why this doesn't fix itself until the next burst of messages? Any suggestions on what to do or what to investigate next? I guess I could run these with strace and see what exactly they say to each other (both the sender and receiver). I'm not sure why (and if) this is 'fixed' by the next burst of messages,but this can be somehow because next portion of messages pushes the queue. Perhaps it is unable to retry after suspend without a push. not sure if 'fixed' is the right word but it is always unstuck rightafter next (sometime next to next) burst of messages, never at random time. This behaviour is the same whether the period is 5 min or 15 min. While I have been silent, I followed the ML discussion (I did not checkthe debug log further, though, as Pavel did excellent work here). To me, it looks like "normal" suspension code is kicking in. If an action fails, rsyslog retries once (except otherwise configured) and if it fails again, the action is initially suspended for 30 seconds. Then, retries happen and the suspension period is prolonged if they fail. It looks very much like this is the mechanism at work. However, what I don't understand is why the suspension period is so long. Out of all this, I think it would make much sense if rsyslog had the capability to report when an action is suspended and when it is resumed. I am right now adding this capability. I would suggest that when this change is ready, you apply it and we can than see what it reports (much easier than walking the debug log, and very obvious to users ;)). as I mentioned I tried to do two changes to the test scenario andtested each of these separately: - send 200 message burst directly from collector-test to collector-prod (RELP), i.e. no load balancer - keep using the load balancer but in addition to 200 messages curst every 5 min also send few (3) messages every minute Each of these scenarios work, as in traffic is smooth, NO silences. This means that something goes wrong if we use load balancer and there is no traffic for 5 minutes. From our previous investigation of amazon elastic load balancer (which is what we're using) it often lies about the connection, i.e. it has no connection on the backend but it happily accepts connects and data and pretends everything is fine (that's why we initially switched from plain TCP to RELP). Not entirely sure what the load balancer is doing in this case but it seems that rsyslog thinks the connection is fine and keeps sending data to the load balancer but connection is actually broken (maybe the load balancer closed the connection between itself and the destination because of some timeout). So the situation is not fixed until rsyslog closes the connection and opens a new one. Is there any way for us to either make rsyslog keep the connection alive or to re-open it sooner? erik _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. _______________________________________________ rsyslog mailing listhttp://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. _______________________________________________rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. _______________________________________________rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. _______________________________________________rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. _______________________________________________rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT._______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT._______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
_______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

