On Fri, Dec 6, 2013 at 8:17 AM, Erik Steffl <[email protected]> wrote: > compiled it successfully but didn't try it yet, was trying to figure out > what's going on with ELB and timeout. > > Will try it tomorrow (fri pacific time), what will I be looking for?
sorry, don't understand that sentence. > Do you think it will log anything interesting given the new information I > just posted? > > Honestly, I haven't look deeply at it, because I need the other information first. Don't waste time on trying to find out things if you don't have the basics... Rainer > thanks! > > erik > > > On 12/05/2013 10:34 PM, Rainer Gerhards wrote: > >> Erik, >> >> have you tried the version I created for you? What does it report? >> >> Rainer >> >> >> On Fri, Dec 6, 2013 at 5:10 AM, Erik Steffl <[email protected]> wrote: >> >> at this point pretty sure it's related to Amazon Elastic Load Balancer >>> (ELB) and some kind of timeout (it only happens if I use ELB and there is >>> enough time between bursts of messages, e.g. 5 min triggers failure, 1 >>> minute does not). >>> >>> Now I am back to original failure scenario (ELB and burst of 100 >>> messages every 5 minutes). >>> >>> Here's output from lsof: >>> >>> ubuntu@ip-10-158-97-169:~$ sudo lsof | grep rsyslogd | grep TCP >>> rsyslogd 9743 syslog 6u IPv4 7177271 0t0 >>> TCP *:5140 (LISTEN) >>> rsyslogd 9743 syslog 7u IPv6 7177272 0t0 >>> TCP *:5140 (LISTEN) >>> rsyslogd 9743 syslog 10u IPv4 7187191 0t0 >>> TCP ip-10-158-97-169.ec2.internal:48176->ec2-50-19-250-187. >>> compute-1.amazonaws.com:5140 (CLOSE_WAIT) >>> rsyslogd 9743 syslog 12u IPv4 7183726 0t0 >>> TCP ip-10-158-97-169.ec2.internal:5140->216.133.125.82:16543( >>> ESTABLISHED) >>> >>> where: >>> >>> ip-10-158-97-169.ec2.internal is the collector-test, the host that is >>> sending messages >>> >>> ec2-50-19-250-187.compute-1.amazonaws.com is ELB >>> >>> Strace shows some interesting stuff, even though I didn't catch the >>> beginning of the problem this time (running strace with -f options so the >>> first number on each line is thread pid): >>> >>> 9842 03:08:44.391025 sendto(10, "204 syslog 380 >>> <134>2013-12-06T02:55:01.747931+00:00 >>> ip-10-158-97-169 logBurst[9908]: @cee:{\"message-json\":{\"now\":\"Fri >>> Dec 6 02:55:01 2013\",\"i\":99},\"yummlyLogOrigin\":{\" >>> supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\": >>> \"prod\",\"role\":\"collectorErik\",\"host\":\"ip- >>> 10-158-97-169\",\"tag\":\"logBurst[9908]:\",\" >>> programname\":\"logBurst\",\"priority\":\"local0.info\",\" >>> timestamp\":\"2013-12-06T02:55:01.747931+00:00\"}}\n\n", 396, 0, NULL, 0 >>> <unfinished ...> >>> >>> ... then nothing related to fd 10 until: >>> >>> 9842 03:10:32.676803 <... sendto resumed> ) = -1 ETIMEDOUT (Connection >>> timed out) >>> 9842 03:10:32.676887 gettimeofday({1386299432, 676918}, NULL) = 0 >>> 9842 03:10:32.676994 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1, >>> linger=0}, 8) = 0 >>> 9842 03:10:32.677101 close(10) = 0 >>> >>> So I assume that means the connection is bad, rsyslog figures it out >>> and >>> closes the connection. >>> >>> Then rsyslog looks up IP of ELB and opens another connection: >>> >>> 9842 03:10:32.687983 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10 >>> 9842 03:10:32.688088 connect(10, {sa_family=AF_INET, >>> sin_port=htons(5140), sin_addr=inet_addr("50.19.250.187")}, 16) = 0 >>> 9842 03:10:32.689592 clock_gettime(CLOCK_REALTIME, {1386299432, >>> 689625848}) = 0 >>> 9842 03:10:32.689694 sendto(10, "1 open 85 >>> relp_version=0\nrelp_software= >>> librelp,1.2.0,http://librelp.adiscon.com\ncommands=syslog\n", 96, 0, >>> NULL, 0) = 96 >>> 9842 03:10:32.689812 recvfrom(10, 0x7ffc58b53490, 32768, 64, 0, 0) = -1 >>> EAGAIN (Resource temporarily unavailable) >>> ... >>> 9842 03:11:32.214608 recvfrom(10, "", 32768, MSG_DONTWAIT, NULL, NULL) >>> = 0 >>> ... >>> 9750 03:12:16.994327 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1, >>> linger=0}, 8 <unfinished ...> >>> 9750 03:12:16.994443 close(10 <unfinished ...> >>> 9750 03:12:16.994597 <... close resumed> ) = 0 >>> >>> So it seems like rsyslog is trying to open a new RELP session but it's >>> getting ETIMEDOUT. This is very weird because all involved hosts (sender, >>> ELB, receiver) work fine at the same time, they continue to send and >>> receive data, ELB thinks the receiver is healthy etc. Everything seems to >>> be working fine except rsyslog is not able to open working connection to >>> ELB (to recevier via ELB). >>> >>> Any ideas how to investigate this further? This seems to be a lot more >>> specific then what I had before (just mysterious silence) so hopefully it >>> makes more sense... >>> >>> thanks! >>> >>> erik >>> >>> >>> On 11/30/2013 02:12 AM, Rainer Gerhards wrote: >>> >>> If the connection fails, rsyslog *immediately* tries to re-open, but >>>> that >>>> seems to fail as well. >>>> >>>> Btw: git v7-devel and master branch niw contain the changes i made. >>>> Could >>>> you please try one of these versions. >>>> >>>> Sent from phone, thus brief. >>>> Am 30.11.2013 00:39 schrieb "Erik Steffl" <[email protected]>: >>>> >>>> On 11/29/2013 01:48 AM, Rainer Gerhards wrote: >>>> >>>>> >>>>> On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> On 11/26/2013 07:06 AM, Pavel Levshin wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> 26.11.2013 1:20, Erik Steffl: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> what does the above mean exactly? It does make sense, in each >>>>>>>> >>>>>>>>> case >>>>>>>>> the burst of messages gets in via /dev/log then is send out via >>>>>>>>> RELP >>>>>>>>> (either to the same machine or a different one). >>>>>>>>> >>>>>>>>> Any ideas why this doesn't fix itself until the next burst of >>>>>>>>> messages? Any suggestions on what to do or what to investigate >>>>>>>>> next? >>>>>>>>> I >>>>>>>>> guess I could run these with strace and see what exactly they say >>>>>>>>> to >>>>>>>>> each other (both the sender and receiver). >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm not sure why (and if) this is 'fixed' by the next burst of >>>>>>>>> >>>>>>>>> messages, >>>>>>>> but this can be somehow because next portion of messages pushes the >>>>>>>> queue. Perhaps it is unable to retry after suspend without a push. >>>>>>>> >>>>>>>> >>>>>>>> not sure if 'fixed' is the right word but it is always unstuck >>>>>>>> >>>>>>> right >>>>>>> after next (sometime next to next) burst of messages, never at random >>>>>>> time. >>>>>>> This behaviour is the same whether the period is 5 min or 15 min. >>>>>>> >>>>>>> >>>>>>> While I have been silent, I followed the ML discussion (I did not >>>>>>> >>>>>> check >>>>>> the >>>>>> debug log further, though, as Pavel did excellent work here). To me, >>>>>> it >>>>>> looks like "normal" suspension code is kicking in. If an action fails, >>>>>> rsyslog retries once (except otherwise configured) and if it fails >>>>>> again, >>>>>> the action is initially suspended for 30 seconds. Then, retries happen >>>>>> and >>>>>> the suspension period is prolonged if they fail. >>>>>> >>>>>> It looks very much like this is the mechanism at work. However, what I >>>>>> don't understand is why the suspension period is so long. >>>>>> >>>>>> Out of all this, I think it would make much sense if rsyslog had the >>>>>> capability to report when an action is suspended and when it is >>>>>> resumed. I >>>>>> am right now adding this capability. I would suggest that when this >>>>>> change >>>>>> is ready, you apply it and we can than see what it reports (much >>>>>> easier >>>>>> than walking the debug log, and very obvious to users ;)). >>>>>> >>>>>> >>>>>> as I mentioned I tried to do two changes to the test scenario and >>>>> tested >>>>> each of these separately: >>>>> >>>>> - send 200 message burst directly from collector-test to >>>>> collector-prod >>>>> (RELP), i.e. no load balancer >>>>> >>>>> - keep using the load balancer but in addition to 200 messages >>>>> curst >>>>> every 5 min also send few (3) messages every minute >>>>> >>>>> Each of these scenarios work, as in traffic is smooth, NO silences. >>>>> This >>>>> means that something goes wrong if we use load balancer and there is no >>>>> traffic for 5 minutes. >>>>> >>>>> From our previous investigation of amazon elastic load balancer >>>>> (which >>>>> is what we're using) it often lies about the connection, i.e. it has no >>>>> connection on the backend but it happily accepts connects and data and >>>>> pretends everything is fine (that's why we initially switched from >>>>> plain >>>>> TCP to RELP). >>>>> >>>>> Not entirely sure what the load balancer is doing in this case but >>>>> it >>>>> seems that rsyslog thinks the connection is fine and keeps sending data >>>>> to >>>>> the load balancer but connection is actually broken (maybe the load >>>>> balancer closed the connection between itself and the destination >>>>> because >>>>> of some timeout). >>>>> >>>>> So the situation is not fixed until rsyslog closes the connection >>>>> and >>>>> opens a new one. >>>>> >>>>> Is there any way for us to either make rsyslog keep the connection >>>>> alive >>>>> or to re-open it sooner? >>>>> >>>>> erik >>>>> >>>>> _______________________________________________ >>>>> rsyslog mailing list >>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>>> http://www.rsyslog.com/professional-services/ >>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>>> myriad >>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>>>> DON'T LIKE THAT. >>>>> >>>>> _______________________________________________ >>>>> >>>> rsyslog mailing list >>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>> http://www.rsyslog.com/professional-services/ >>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>>> DON'T LIKE THAT. >>>> >>>> >>>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com/professional-services/ >>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>> DON'T LIKE THAT. >>> >>> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com/professional-services/ >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. >> >> > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

