Erik,

have you tried the version I created for you? What does it report?

Rainer


On Fri, Dec 6, 2013 at 5:10 AM, Erik Steffl <[email protected]> wrote:

>   at this point pretty sure it's related to Amazon Elastic Load Balancer
> (ELB) and some kind of timeout (it only happens if I use ELB and there is
> enough time between bursts of messages, e.g. 5 min triggers failure, 1
> minute does not).
>
>   Now I am back to original failure scenario (ELB and burst of 100
> messages every 5 minutes).
>
>   Here's output from lsof:
>
> ubuntu@ip-10-158-97-169:~$ sudo lsof | grep rsyslogd | grep TCP
> rsyslogd   9743          syslog    6u     IPv4            7177271 0t0
>    TCP *:5140 (LISTEN)
> rsyslogd   9743          syslog    7u     IPv6            7177272 0t0
>    TCP *:5140 (LISTEN)
> rsyslogd   9743          syslog   10u     IPv4            7187191 0t0
>    TCP ip-10-158-97-169.ec2.internal:48176->ec2-50-19-250-187.
> compute-1.amazonaws.com:5140 (CLOSE_WAIT)
> rsyslogd   9743          syslog   12u     IPv4            7183726 0t0
>    TCP ip-10-158-97-169.ec2.internal:5140->216.133.125.82:16543(ESTABLISHED)
>
>   where:
>
> ip-10-158-97-169.ec2.internal is the collector-test, the host that is
> sending messages
>
> ec2-50-19-250-187.compute-1.amazonaws.com is ELB
>
> Strace shows some interesting stuff, even though I didn't catch the
> beginning of the problem this time (running strace with -f options so the
> first number on each line is thread pid):
>
> 9842  03:08:44.391025 sendto(10, "204 syslog 380 
> <134>2013-12-06T02:55:01.747931+00:00
> ip-10-158-97-169 logBurst[9908]: @cee:{\"message-json\":{\"now\":\"Fri
> Dec  6 02:55:01 2013\",\"i\":99},\"yummlyLogOrigin\":{\"
> supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":
> \"prod\",\"role\":\"collectorErik\",\"host\":\"ip-
> 10-158-97-169\",\"tag\":\"logBurst[9908]:\",\"
> programname\":\"logBurst\",\"priority\":\"local0.info\",\"
> timestamp\":\"2013-12-06T02:55:01.747931+00:00\"}}\n\n", 396, 0, NULL, 0
> <unfinished ...>
>
> ... then nothing related to fd 10 until:
>
> 9842  03:10:32.676803 <... sendto resumed> ) = -1 ETIMEDOUT (Connection
> timed out)
> 9842  03:10:32.676887 gettimeofday({1386299432, 676918}, NULL) = 0
> 9842  03:10:32.676994 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1,
> linger=0}, 8) = 0
> 9842  03:10:32.677101 close(10)         = 0
>
>   So I assume that means the connection is bad, rsyslog figures it out and
> closes the connection.
>
>   Then rsyslog looks up IP of ELB and opens another connection:
>
> 9842  03:10:32.687983 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
> 9842  03:10:32.688088 connect(10, {sa_family=AF_INET,
> sin_port=htons(5140), sin_addr=inet_addr("50.19.250.187")}, 16) = 0
> 9842  03:10:32.689592 clock_gettime(CLOCK_REALTIME, {1386299432,
> 689625848}) = 0
> 9842  03:10:32.689694 sendto(10, "1 open 85 relp_version=0\nrelp_software=
> librelp,1.2.0,http://librelp.adiscon.com\ncommands=syslog\n";, 96, 0,
> NULL, 0) = 96
> 9842  03:10:32.689812 recvfrom(10, 0x7ffc58b53490, 32768, 64, 0, 0) = -1
> EAGAIN (Resource temporarily unavailable)
> ...
> 9842  03:11:32.214608 recvfrom(10, "", 32768, MSG_DONTWAIT, NULL, NULL) = 0
> ...
> 9750  03:12:16.994327 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1,
> linger=0}, 8 <unfinished ...>
> 9750  03:12:16.994443 close(10 <unfinished ...>
> 9750  03:12:16.994597 <... close resumed> ) = 0
>
>   So it seems like rsyslog is trying to open a new RELP session but it's
> getting ETIMEDOUT. This is very weird because all involved hosts (sender,
> ELB, receiver) work fine at the same time, they continue to send and
> receive data, ELB thinks the receiver is healthy etc. Everything seems to
> be working fine except rsyslog is not able to open working connection to
> ELB (to recevier via ELB).
>
>   Any ideas how to investigate this further? This seems to be a lot more
> specific then what I had before (just mysterious silence) so hopefully it
> makes more sense...
>
>   thanks!
>
>         erik
>
>
> On 11/30/2013 02:12 AM, Rainer Gerhards wrote:
>
>> If the connection fails, rsyslog *immediately* tries to re-open, but that
>> seems to fail as well.
>>
>> Btw: git v7-devel and master branch niw contain the changes i made.  Could
>> you please try one of these versions.
>>
>> Sent from phone, thus brief.
>> Am 30.11.2013 00:39 schrieb "Erik Steffl" <[email protected]>:
>>
>>  On 11/29/2013 01:48 AM, Rainer Gerhards wrote:
>>>
>>>  On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]> wrote:
>>>>
>>>>   On 11/26/2013 07:06 AM, Pavel Levshin wrote:
>>>>
>>>>>
>>>>>
>>>>>  26.11.2013 1:20, Erik Steffl:
>>>>>>
>>>>>>
>>>>>>      what does the above mean exactly? It does make sense, in each
>>>>>>> case
>>>>>>> the burst of messages gets in via /dev/log then is send out via RELP
>>>>>>> (either to the same machine or a different one).
>>>>>>>
>>>>>>>     Any ideas why this doesn't fix itself until the next burst of
>>>>>>> messages? Any suggestions on what to do or what to investigate next?
>>>>>>> I
>>>>>>> guess I could run these with strace and see what exactly they say to
>>>>>>> each other (both the sender and receiver).
>>>>>>>
>>>>>>>
>>>>>>>   I'm not sure why (and if) this is 'fixed' by the next burst of
>>>>>>>
>>>>>> messages,
>>>>>> but this can be somehow because next portion of messages pushes the
>>>>>> queue. Perhaps it is unable to retry after suspend without a push.
>>>>>>
>>>>>>
>>>>>>      not sure if 'fixed' is the right word but it is always unstuck
>>>>> right
>>>>> after next (sometime next to next) burst of messages, never at random
>>>>> time.
>>>>> This behaviour is the same whether the period is 5 min or 15 min.
>>>>>
>>>>>
>>>>>  While I have been silent, I followed the ML discussion (I did not
>>>> check
>>>> the
>>>> debug log further, though, as Pavel did excellent work here). To me, it
>>>> looks like "normal" suspension code is kicking in. If an action fails,
>>>> rsyslog retries once (except otherwise configured) and if it fails
>>>> again,
>>>> the action is initially suspended for 30 seconds. Then, retries happen
>>>> and
>>>> the suspension period is prolonged if they fail.
>>>>
>>>> It looks very much like this is the mechanism at work. However, what I
>>>> don't understand is why the suspension period is so long.
>>>>
>>>> Out of all this, I think it would make much sense if rsyslog had the
>>>> capability to report when an action is suspended and when it is
>>>> resumed. I
>>>> am right now adding this capability. I would suggest that when this
>>>> change
>>>> is ready, you apply it and we can than see what it reports (much easier
>>>> than walking the debug log, and very obvious to users ;)).
>>>>
>>>>
>>>    as I mentioned I tried to do two changes to the test scenario and
>>> tested
>>> each of these separately:
>>>
>>>    - send 200 message burst directly from collector-test to
>>> collector-prod
>>> (RELP), i.e. no load balancer
>>>
>>>    - keep using the load balancer but in addition to 200 messages curst
>>> every 5 min also send few (3) messages every minute
>>>
>>>    Each of these scenarios work, as in traffic is smooth, NO silences.
>>> This
>>> means that something goes wrong if we use load balancer and there is no
>>> traffic for 5 minutes.
>>>
>>>    From our previous investigation of amazon elastic load balancer (which
>>> is what we're using) it often lies about the connection, i.e. it has no
>>> connection on the backend but it happily accepts connects and data and
>>> pretends everything is fine (that's why we initially switched from plain
>>> TCP to RELP).
>>>
>>>    Not entirely sure what the load balancer is doing in this case but it
>>> seems that rsyslog thinks the connection is fine and keeps sending data
>>> to
>>> the load balancer but connection is actually broken (maybe the load
>>> balancer closed the connection between itself and the destination because
>>> of some timeout).
>>>
>>>    So the situation is not fixed until rsyslog closes the connection and
>>> opens a new one.
>>>
>>>    Is there any way for us to either make rsyslog keep the connection
>>> alive
>>> or to re-open it sooner?
>>>
>>>          erik
>>>
>>> _______________________________________________
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com/professional-services/
>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>> DON'T LIKE THAT.
>>>
>>>  _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
>>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to