On Fri, Dec 6, 2013 at 10:00 AM, Erik Steffl <[email protected]> wrote:
> On 12/05/2013 11:27 PM, Rainer Gerhards wrote: > >> On Fri, Dec 6, 2013 at 8:17 AM, Erik Steffl <[email protected]> wrote: >> >> compiled it successfully but didn't try it yet, was trying to figure >>> out >>> what's going on with ELB and timeout. >>> >>> Will try it tomorrow (fri pacific time), what will I be looking for? >>> >> >> >> sorry, don't understand that sentence. >> > > is the new version going to print additional logs to /var/log/syslog? yes, it will log messages when an action is suspended and when it is reactivated again. I think I also added stats counters for these events. But was already a couple of days (and > 100 commits) ago, so I don't know exactly out of my head. > Or debug logs? Would these be triggered when there is timeout waiting for > RELP response? It's triggered whenever the action decides it fails. That's exactly what I would like to know. Rainer > > Do you think it will log anything interesting given the new information I >>> just posted? >>> >>> >>> Honestly, I haven't look deeply at it, because I need the other >> information >> first. Don't waste time on trying to find out things if you don't have the >> basics... >> > > alright, will test with rsyslog I complied from git (master) and see > what happens, > > thanks, > > erik > > > >> Rainer >> >> >> thanks! >>> >>> erik >>> >>> >>> On 12/05/2013 10:34 PM, Rainer Gerhards wrote: >>> >>> Erik, >>>> >>>> have you tried the version I created for you? What does it report? >>>> >>>> Rainer >>>> >>>> >>>> On Fri, Dec 6, 2013 at 5:10 AM, Erik Steffl <[email protected]> wrote: >>>> >>>> at this point pretty sure it's related to Amazon Elastic Load >>>> Balancer >>>> >>>>> (ELB) and some kind of timeout (it only happens if I use ELB and there >>>>> is >>>>> enough time between bursts of messages, e.g. 5 min triggers failure, 1 >>>>> minute does not). >>>>> >>>>> Now I am back to original failure scenario (ELB and burst of 100 >>>>> messages every 5 minutes). >>>>> >>>>> Here's output from lsof: >>>>> >>>>> ubuntu@ip-10-158-97-169:~$ sudo lsof | grep rsyslogd | grep TCP >>>>> rsyslogd 9743 syslog 6u IPv4 7177271 0t0 >>>>> TCP *:5140 (LISTEN) >>>>> rsyslogd 9743 syslog 7u IPv6 7177272 0t0 >>>>> TCP *:5140 (LISTEN) >>>>> rsyslogd 9743 syslog 10u IPv4 7187191 0t0 >>>>> TCP ip-10-158-97-169.ec2.internal:48176->ec2-50-19-250-187. >>>>> compute-1.amazonaws.com:5140 (CLOSE_WAIT) >>>>> rsyslogd 9743 syslog 12u IPv4 7183726 0t0 >>>>> TCP ip-10-158-97-169.ec2.internal:5140->216.133.125.82:16543( >>>>> ESTABLISHED) >>>>> >>>>> where: >>>>> >>>>> ip-10-158-97-169.ec2.internal is the collector-test, the host that is >>>>> sending messages >>>>> >>>>> ec2-50-19-250-187.compute-1.amazonaws.com is ELB >>>>> >>>>> Strace shows some interesting stuff, even though I didn't catch the >>>>> beginning of the problem this time (running strace with -f options so >>>>> the >>>>> first number on each line is thread pid): >>>>> >>>>> 9842 03:08:44.391025 sendto(10, "204 syslog 380 >>>>> <134>2013-12-06T02:55:01.747931+00:00 >>>>> ip-10-158-97-169 logBurst[9908]: @cee:{\"message-json\":{\"now\":\"Fri >>>>> Dec 6 02:55:01 2013\",\"i\":99},\"yummlyLogOrigin\":{\" >>>>> supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\": >>>>> \"prod\",\"role\":\"collectorErik\",\"host\":\"ip- >>>>> 10-158-97-169\",\"tag\":\"logBurst[9908]:\",\" >>>>> programname\":\"logBurst\",\"priority\":\"local0.info\",\" >>>>> timestamp\":\"2013-12-06T02:55:01.747931+00:00\"}}\n\n", 396, 0, >>>>> NULL, 0 >>>>> <unfinished ...> >>>>> >>>>> ... then nothing related to fd 10 until: >>>>> >>>>> 9842 03:10:32.676803 <... sendto resumed> ) = -1 ETIMEDOUT (Connection >>>>> timed out) >>>>> 9842 03:10:32.676887 gettimeofday({1386299432, 676918}, NULL) = 0 >>>>> 9842 03:10:32.676994 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1, >>>>> linger=0}, 8) = 0 >>>>> 9842 03:10:32.677101 close(10) = 0 >>>>> >>>>> So I assume that means the connection is bad, rsyslog figures it >>>>> out >>>>> and >>>>> closes the connection. >>>>> >>>>> Then rsyslog looks up IP of ELB and opens another connection: >>>>> >>>>> 9842 03:10:32.687983 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10 >>>>> 9842 03:10:32.688088 connect(10, {sa_family=AF_INET, >>>>> sin_port=htons(5140), sin_addr=inet_addr("50.19.250.187")}, 16) = 0 >>>>> 9842 03:10:32.689592 clock_gettime(CLOCK_REALTIME, {1386299432, >>>>> 689625848}) = 0 >>>>> 9842 03:10:32.689694 sendto(10, "1 open 85 >>>>> relp_version=0\nrelp_software= >>>>> librelp,1.2.0,http://librelp.adiscon.com\ncommands=syslog\n", 96, 0, >>>>> NULL, 0) = 96 >>>>> 9842 03:10:32.689812 recvfrom(10, 0x7ffc58b53490, 32768, 64, 0, 0) = >>>>> -1 >>>>> EAGAIN (Resource temporarily unavailable) >>>>> ... >>>>> 9842 03:11:32.214608 recvfrom(10, "", 32768, MSG_DONTWAIT, NULL, NULL) >>>>> = 0 >>>>> ... >>>>> 9750 03:12:16.994327 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1, >>>>> linger=0}, 8 <unfinished ...> >>>>> 9750 03:12:16.994443 close(10 <unfinished ...> >>>>> 9750 03:12:16.994597 <... close resumed> ) = 0 >>>>> >>>>> So it seems like rsyslog is trying to open a new RELP session but >>>>> it's >>>>> getting ETIMEDOUT. This is very weird because all involved hosts >>>>> (sender, >>>>> ELB, receiver) work fine at the same time, they continue to send and >>>>> receive data, ELB thinks the receiver is healthy etc. Everything seems >>>>> to >>>>> be working fine except rsyslog is not able to open working connection >>>>> to >>>>> ELB (to recevier via ELB). >>>>> >>>>> Any ideas how to investigate this further? This seems to be a lot >>>>> more >>>>> specific then what I had before (just mysterious silence) so hopefully >>>>> it >>>>> makes more sense... >>>>> >>>>> thanks! >>>>> >>>>> erik >>>>> >>>>> >>>>> On 11/30/2013 02:12 AM, Rainer Gerhards wrote: >>>>> >>>>> If the connection fails, rsyslog *immediately* tries to re-open, but >>>>> >>>>>> that >>>>>> seems to fail as well. >>>>>> >>>>>> Btw: git v7-devel and master branch niw contain the changes i made. >>>>>> Could >>>>>> you please try one of these versions. >>>>>> >>>>>> Sent from phone, thus brief. >>>>>> Am 30.11.2013 00:39 schrieb "Erik Steffl" <[email protected]>: >>>>>> >>>>>> On 11/29/2013 01:48 AM, Rainer Gerhards wrote: >>>>>> >>>>>> >>>>>>> On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> On 11/26/2013 07:06 AM, Pavel Levshin wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> 26.11.2013 1:20, Erik Steffl: >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> what does the above mean exactly? It does make sense, in >>>>>>>>>> each >>>>>>>>>> >>>>>>>>>> case >>>>>>>>>>> the burst of messages gets in via /dev/log then is send out via >>>>>>>>>>> RELP >>>>>>>>>>> (either to the same machine or a different one). >>>>>>>>>>> >>>>>>>>>>> Any ideas why this doesn't fix itself until the next burst >>>>>>>>>>> of >>>>>>>>>>> messages? Any suggestions on what to do or what to investigate >>>>>>>>>>> next? >>>>>>>>>>> I >>>>>>>>>>> guess I could run these with strace and see what exactly they say >>>>>>>>>>> to >>>>>>>>>>> each other (both the sender and receiver). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm not sure why (and if) this is 'fixed' by the next burst >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> messages, >>>>>>>>>>> >>>>>>>>>> but this can be somehow because next portion of messages pushes >>>>>>>>>> the >>>>>>>>>> queue. Perhaps it is unable to retry after suspend without a push. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> not sure if 'fixed' is the right word but it is always >>>>>>>>>> unstuck >>>>>>>>>> >>>>>>>>>> right >>>>>>>>> after next (sometime next to next) burst of messages, never at >>>>>>>>> random >>>>>>>>> time. >>>>>>>>> This behaviour is the same whether the period is 5 min or 15 min. >>>>>>>>> >>>>>>>>> >>>>>>>>> While I have been silent, I followed the ML discussion (I did >>>>>>>>> not >>>>>>>>> >>>>>>>>> check >>>>>>>> the >>>>>>>> debug log further, though, as Pavel did excellent work here). To me, >>>>>>>> it >>>>>>>> looks like "normal" suspension code is kicking in. If an action >>>>>>>> fails, >>>>>>>> rsyslog retries once (except otherwise configured) and if it fails >>>>>>>> again, >>>>>>>> the action is initially suspended for 30 seconds. Then, retries >>>>>>>> happen >>>>>>>> and >>>>>>>> the suspension period is prolonged if they fail. >>>>>>>> >>>>>>>> It looks very much like this is the mechanism at work. However, >>>>>>>> what I >>>>>>>> don't understand is why the suspension period is so long. >>>>>>>> >>>>>>>> Out of all this, I think it would make much sense if rsyslog had the >>>>>>>> capability to report when an action is suspended and when it is >>>>>>>> resumed. I >>>>>>>> am right now adding this capability. I would suggest that when this >>>>>>>> change >>>>>>>> is ready, you apply it and we can than see what it reports (much >>>>>>>> easier >>>>>>>> than walking the debug log, and very obvious to users ;)). >>>>>>>> >>>>>>>> >>>>>>>> as I mentioned I tried to do two changes to the test scenario >>>>>>>> and >>>>>>>> >>>>>>> tested >>>>>>> each of these separately: >>>>>>> >>>>>>> - send 200 message burst directly from collector-test to >>>>>>> collector-prod >>>>>>> (RELP), i.e. no load balancer >>>>>>> >>>>>>> - keep using the load balancer but in addition to 200 messages >>>>>>> curst >>>>>>> every 5 min also send few (3) messages every minute >>>>>>> >>>>>>> Each of these scenarios work, as in traffic is smooth, NO >>>>>>> silences. >>>>>>> This >>>>>>> means that something goes wrong if we use load balancer and there is >>>>>>> no >>>>>>> traffic for 5 minutes. >>>>>>> >>>>>>> From our previous investigation of amazon elastic load balancer >>>>>>> (which >>>>>>> is what we're using) it often lies about the connection, i.e. it has >>>>>>> no >>>>>>> connection on the backend but it happily accepts connects and data >>>>>>> and >>>>>>> pretends everything is fine (that's why we initially switched from >>>>>>> plain >>>>>>> TCP to RELP). >>>>>>> >>>>>>> Not entirely sure what the load balancer is doing in this case >>>>>>> but >>>>>>> it >>>>>>> seems that rsyslog thinks the connection is fine and keeps sending >>>>>>> data >>>>>>> to >>>>>>> the load balancer but connection is actually broken (maybe the load >>>>>>> balancer closed the connection between itself and the destination >>>>>>> because >>>>>>> of some timeout). >>>>>>> >>>>>>> So the situation is not fixed until rsyslog closes the >>>>>>> connection >>>>>>> and >>>>>>> opens a new one. >>>>>>> >>>>>>> Is there any way for us to either make rsyslog keep the >>>>>>> connection >>>>>>> alive >>>>>>> or to re-open it sooner? >>>>>>> >>>>>>> erik >>>>>>> >>>>>>> _______________________________________________ >>>>>>> rsyslog mailing list >>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>>>>> http://www.rsyslog.com/professional-services/ >>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>>>>> myriad >>>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if >>>>>>> you >>>>>>> DON'T LIKE THAT. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> >>>>>>> rsyslog mailing list >>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>>>> http://www.rsyslog.com/professional-services/ >>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>>>> myriad >>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>>>>> DON'T LIKE THAT. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> >>>>> rsyslog mailing list >>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>>> http://www.rsyslog.com/professional-services/ >>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a >>>>> myriad >>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>>>> DON'T LIKE THAT. >>>>> >>>>> _______________________________________________ >>>>> >>>> rsyslog mailing list >>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>> http://www.rsyslog.com/professional-services/ >>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>>> DON'T LIKE THAT. >>>> >>>> >>>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com/professional-services/ >>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>> DON'T LIKE THAT. >>> >>> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com/professional-services/ >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. >> >> > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

