Erik, have you tried the version I created for you? What does it report?
Rainer On Fri, Dec 6, 2013 at 5:10 AM, Erik Steffl <[email protected]> wrote: > at this point pretty sure it's related to Amazon Elastic Load Balancer > (ELB) and some kind of timeout (it only happens if I use ELB and there is > enough time between bursts of messages, e.g. 5 min triggers failure, 1 > minute does not). > > Now I am back to original failure scenario (ELB and burst of 100 > messages every 5 minutes). > > Here's output from lsof: > > ubuntu@ip-10-158-97-169:~$ sudo lsof | grep rsyslogd | grep TCP > rsyslogd 9743 syslog 6u IPv4 7177271 0t0 > TCP *:5140 (LISTEN) > rsyslogd 9743 syslog 7u IPv6 7177272 0t0 > TCP *:5140 (LISTEN) > rsyslogd 9743 syslog 10u IPv4 7187191 0t0 > TCP ip-10-158-97-169.ec2.internal:48176->ec2-50-19-250-187. > compute-1.amazonaws.com:5140 (CLOSE_WAIT) > rsyslogd 9743 syslog 12u IPv4 7183726 0t0 > TCP ip-10-158-97-169.ec2.internal:5140->216.133.125.82:16543(ESTABLISHED) > > where: > > ip-10-158-97-169.ec2.internal is the collector-test, the host that is > sending messages > > ec2-50-19-250-187.compute-1.amazonaws.com is ELB > > Strace shows some interesting stuff, even though I didn't catch the > beginning of the problem this time (running strace with -f options so the > first number on each line is thread pid): > > 9842 03:08:44.391025 sendto(10, "204 syslog 380 > <134>2013-12-06T02:55:01.747931+00:00 > ip-10-158-97-169 logBurst[9908]: @cee:{\"message-json\":{\"now\":\"Fri > Dec 6 02:55:01 2013\",\"i\":99},\"yummlyLogOrigin\":{\" > supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\": > \"prod\",\"role\":\"collectorErik\",\"host\":\"ip- > 10-158-97-169\",\"tag\":\"logBurst[9908]:\",\" > programname\":\"logBurst\",\"priority\":\"local0.info\",\" > timestamp\":\"2013-12-06T02:55:01.747931+00:00\"}}\n\n", 396, 0, NULL, 0 > <unfinished ...> > > ... then nothing related to fd 10 until: > > 9842 03:10:32.676803 <... sendto resumed> ) = -1 ETIMEDOUT (Connection > timed out) > 9842 03:10:32.676887 gettimeofday({1386299432, 676918}, NULL) = 0 > 9842 03:10:32.676994 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1, > linger=0}, 8) = 0 > 9842 03:10:32.677101 close(10) = 0 > > So I assume that means the connection is bad, rsyslog figures it out and > closes the connection. > > Then rsyslog looks up IP of ELB and opens another connection: > > 9842 03:10:32.687983 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10 > 9842 03:10:32.688088 connect(10, {sa_family=AF_INET, > sin_port=htons(5140), sin_addr=inet_addr("50.19.250.187")}, 16) = 0 > 9842 03:10:32.689592 clock_gettime(CLOCK_REALTIME, {1386299432, > 689625848}) = 0 > 9842 03:10:32.689694 sendto(10, "1 open 85 relp_version=0\nrelp_software= > librelp,1.2.0,http://librelp.adiscon.com\ncommands=syslog\n", 96, 0, > NULL, 0) = 96 > 9842 03:10:32.689812 recvfrom(10, 0x7ffc58b53490, 32768, 64, 0, 0) = -1 > EAGAIN (Resource temporarily unavailable) > ... > 9842 03:11:32.214608 recvfrom(10, "", 32768, MSG_DONTWAIT, NULL, NULL) = 0 > ... > 9750 03:12:16.994327 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1, > linger=0}, 8 <unfinished ...> > 9750 03:12:16.994443 close(10 <unfinished ...> > 9750 03:12:16.994597 <... close resumed> ) = 0 > > So it seems like rsyslog is trying to open a new RELP session but it's > getting ETIMEDOUT. This is very weird because all involved hosts (sender, > ELB, receiver) work fine at the same time, they continue to send and > receive data, ELB thinks the receiver is healthy etc. Everything seems to > be working fine except rsyslog is not able to open working connection to > ELB (to recevier via ELB). > > Any ideas how to investigate this further? This seems to be a lot more > specific then what I had before (just mysterious silence) so hopefully it > makes more sense... > > thanks! > > erik > > > On 11/30/2013 02:12 AM, Rainer Gerhards wrote: > >> If the connection fails, rsyslog *immediately* tries to re-open, but that >> seems to fail as well. >> >> Btw: git v7-devel and master branch niw contain the changes i made. Could >> you please try one of these versions. >> >> Sent from phone, thus brief. >> Am 30.11.2013 00:39 schrieb "Erik Steffl" <[email protected]>: >> >> On 11/29/2013 01:48 AM, Rainer Gerhards wrote: >>> >>> On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]> wrote: >>>> >>>> On 11/26/2013 07:06 AM, Pavel Levshin wrote: >>>> >>>>> >>>>> >>>>> 26.11.2013 1:20, Erik Steffl: >>>>>> >>>>>> >>>>>> what does the above mean exactly? It does make sense, in each >>>>>>> case >>>>>>> the burst of messages gets in via /dev/log then is send out via RELP >>>>>>> (either to the same machine or a different one). >>>>>>> >>>>>>> Any ideas why this doesn't fix itself until the next burst of >>>>>>> messages? Any suggestions on what to do or what to investigate next? >>>>>>> I >>>>>>> guess I could run these with strace and see what exactly they say to >>>>>>> each other (both the sender and receiver). >>>>>>> >>>>>>> >>>>>>> I'm not sure why (and if) this is 'fixed' by the next burst of >>>>>>> >>>>>> messages, >>>>>> but this can be somehow because next portion of messages pushes the >>>>>> queue. Perhaps it is unable to retry after suspend without a push. >>>>>> >>>>>> >>>>>> not sure if 'fixed' is the right word but it is always unstuck >>>>> right >>>>> after next (sometime next to next) burst of messages, never at random >>>>> time. >>>>> This behaviour is the same whether the period is 5 min or 15 min. >>>>> >>>>> >>>>> While I have been silent, I followed the ML discussion (I did not >>>> check >>>> the >>>> debug log further, though, as Pavel did excellent work here). To me, it >>>> looks like "normal" suspension code is kicking in. If an action fails, >>>> rsyslog retries once (except otherwise configured) and if it fails >>>> again, >>>> the action is initially suspended for 30 seconds. Then, retries happen >>>> and >>>> the suspension period is prolonged if they fail. >>>> >>>> It looks very much like this is the mechanism at work. However, what I >>>> don't understand is why the suspension period is so long. >>>> >>>> Out of all this, I think it would make much sense if rsyslog had the >>>> capability to report when an action is suspended and when it is >>>> resumed. I >>>> am right now adding this capability. I would suggest that when this >>>> change >>>> is ready, you apply it and we can than see what it reports (much easier >>>> than walking the debug log, and very obvious to users ;)). >>>> >>>> >>> as I mentioned I tried to do two changes to the test scenario and >>> tested >>> each of these separately: >>> >>> - send 200 message burst directly from collector-test to >>> collector-prod >>> (RELP), i.e. no load balancer >>> >>> - keep using the load balancer but in addition to 200 messages curst >>> every 5 min also send few (3) messages every minute >>> >>> Each of these scenarios work, as in traffic is smooth, NO silences. >>> This >>> means that something goes wrong if we use load balancer and there is no >>> traffic for 5 minutes. >>> >>> From our previous investigation of amazon elastic load balancer (which >>> is what we're using) it often lies about the connection, i.e. it has no >>> connection on the backend but it happily accepts connects and data and >>> pretends everything is fine (that's why we initially switched from plain >>> TCP to RELP). >>> >>> Not entirely sure what the load balancer is doing in this case but it >>> seems that rsyslog thinks the connection is fine and keeps sending data >>> to >>> the load balancer but connection is actually broken (maybe the load >>> balancer closed the connection between itself and the destination because >>> of some timeout). >>> >>> So the situation is not fixed until rsyslog closes the connection and >>> opens a new one. >>> >>> Is there any way for us to either make rsyslog keep the connection >>> alive >>> or to re-open it sooner? >>> >>> erik >>> >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com/professional-services/ >>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>> DON'T LIKE THAT. >>> >>> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com/professional-services/ >> What's up with rsyslog? Follow https://twitter.com/rgerhards >> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >> DON'T LIKE THAT. >> >> > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

