Re: [rsyslog] Rsyslog with RELP not sending/receiving messages for long intervals

Rainer Gerhards Thu, 05 Dec 2013 23:27:49 -0800

On Fri, Dec 6, 2013 at 8:17 AM, Erik Steffl <[email protected]> wrote:

>   compiled it successfully but didn't try it yet, was trying to figure out
> what's going on with ELB and timeout.
>
>   Will try it tomorrow (fri pacific time), what will I be looking for?



sorry, don't understand that sentence.


> Do you think it will log anything interesting given the new information I
> just posted?
>
>
Honestly, I haven't look deeply at it, because I need the other information
first. Don't waste time on trying to find out things if you don't have the
basics...

Rainer


>   thanks!
>
>         erik
>
>
> On 12/05/2013 10:34 PM, Rainer Gerhards wrote:
>
>> Erik,
>>
>> have you tried the version I created for you? What does it report?
>>
>> Rainer
>>
>>
>> On Fri, Dec 6, 2013 at 5:10 AM, Erik Steffl <[email protected]> wrote:
>>
>>     at this point pretty sure it's related to Amazon Elastic Load Balancer
>>> (ELB) and some kind of timeout (it only happens if I use ELB and there is
>>> enough time between bursts of messages, e.g. 5 min triggers failure, 1
>>> minute does not).
>>>
>>>    Now I am back to original failure scenario (ELB and burst of 100
>>> messages every 5 minutes).
>>>
>>>    Here's output from lsof:
>>>
>>> ubuntu@ip-10-158-97-169:~$ sudo lsof | grep rsyslogd | grep TCP
>>> rsyslogd   9743          syslog    6u     IPv4            7177271 0t0
>>>     TCP *:5140 (LISTEN)
>>> rsyslogd   9743          syslog    7u     IPv6            7177272 0t0
>>>     TCP *:5140 (LISTEN)
>>> rsyslogd   9743          syslog   10u     IPv4            7187191 0t0
>>>     TCP ip-10-158-97-169.ec2.internal:48176->ec2-50-19-250-187.
>>> compute-1.amazonaws.com:5140 (CLOSE_WAIT)
>>> rsyslogd   9743          syslog   12u     IPv4            7183726 0t0
>>>     TCP ip-10-158-97-169.ec2.internal:5140->216.133.125.82:16543(
>>> ESTABLISHED)
>>>
>>>    where:
>>>
>>> ip-10-158-97-169.ec2.internal is the collector-test, the host that is
>>> sending messages
>>>
>>> ec2-50-19-250-187.compute-1.amazonaws.com is ELB
>>>
>>> Strace shows some interesting stuff, even though I didn't catch the
>>> beginning of the problem this time (running strace with -f options so the
>>> first number on each line is thread pid):
>>>
>>> 9842  03:08:44.391025 sendto(10, "204 syslog 380
>>> <134>2013-12-06T02:55:01.747931+00:00
>>> ip-10-158-97-169 logBurst[9908]: @cee:{\"message-json\":{\"now\":\"Fri
>>> Dec  6 02:55:01 2013\",\"i\":99},\"yummlyLogOrigin\":{\"
>>> supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":
>>> \"prod\",\"role\":\"collectorErik\",\"host\":\"ip-
>>> 10-158-97-169\",\"tag\":\"logBurst[9908]:\",\"
>>> programname\":\"logBurst\",\"priority\":\"local0.info\",\"
>>> timestamp\":\"2013-12-06T02:55:01.747931+00:00\"}}\n\n", 396, 0, NULL, 0
>>> <unfinished ...>
>>>
>>> ... then nothing related to fd 10 until:
>>>
>>> 9842  03:10:32.676803 <... sendto resumed> ) = -1 ETIMEDOUT (Connection
>>> timed out)
>>> 9842  03:10:32.676887 gettimeofday({1386299432, 676918}, NULL) = 0
>>> 9842  03:10:32.676994 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1,
>>> linger=0}, 8) = 0
>>> 9842  03:10:32.677101 close(10)         = 0
>>>
>>>    So I assume that means the connection is bad, rsyslog figures it out
>>> and
>>> closes the connection.
>>>
>>>    Then rsyslog looks up IP of ELB and opens another connection:
>>>
>>> 9842  03:10:32.687983 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
>>> 9842  03:10:32.688088 connect(10, {sa_family=AF_INET,
>>> sin_port=htons(5140), sin_addr=inet_addr("50.19.250.187")}, 16) = 0
>>> 9842  03:10:32.689592 clock_gettime(CLOCK_REALTIME, {1386299432,
>>> 689625848}) = 0
>>> 9842  03:10:32.689694 sendto(10, "1 open 85
>>> relp_version=0\nrelp_software=
>>> librelp,1.2.0,http://librelp.adiscon.com\ncommands=syslog\n";, 96, 0,
>>> NULL, 0) = 96
>>> 9842  03:10:32.689812 recvfrom(10, 0x7ffc58b53490, 32768, 64, 0, 0) = -1
>>> EAGAIN (Resource temporarily unavailable)
>>> ...
>>> 9842  03:11:32.214608 recvfrom(10, "", 32768, MSG_DONTWAIT, NULL, NULL)
>>> = 0
>>> ...
>>> 9750  03:12:16.994327 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1,
>>> linger=0}, 8 <unfinished ...>
>>> 9750  03:12:16.994443 close(10 <unfinished ...>
>>> 9750  03:12:16.994597 <... close resumed> ) = 0
>>>
>>>    So it seems like rsyslog is trying to open a new RELP session but it's
>>> getting ETIMEDOUT. This is very weird because all involved hosts (sender,
>>> ELB, receiver) work fine at the same time, they continue to send and
>>> receive data, ELB thinks the receiver is healthy etc. Everything seems to
>>> be working fine except rsyslog is not able to open working connection to
>>> ELB (to recevier via ELB).
>>>
>>>    Any ideas how to investigate this further? This seems to be a lot more
>>> specific then what I had before (just mysterious silence) so hopefully it
>>> makes more sense...
>>>
>>>    thanks!
>>>
>>>          erik
>>>
>>>
>>> On 11/30/2013 02:12 AM, Rainer Gerhards wrote:
>>>
>>>  If the connection fails, rsyslog *immediately* tries to re-open, but
>>>> that
>>>> seems to fail as well.
>>>>
>>>> Btw: git v7-devel and master branch niw contain the changes i made.
>>>>  Could
>>>> you please try one of these versions.
>>>>
>>>> Sent from phone, thus brief.
>>>> Am 30.11.2013 00:39 schrieb "Erik Steffl" <[email protected]>:
>>>>
>>>>   On 11/29/2013 01:48 AM, Rainer Gerhards wrote:
>>>>
>>>>>
>>>>>   On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>    On 11/26/2013 07:06 AM, Pavel Levshin wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>   26.11.2013 1:20, Erik Steffl:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>       what does the above mean exactly? It does make sense, in each
>>>>>>>>
>>>>>>>>> case
>>>>>>>>> the burst of messages gets in via /dev/log then is send out via
>>>>>>>>> RELP
>>>>>>>>> (either to the same machine or a different one).
>>>>>>>>>
>>>>>>>>>      Any ideas why this doesn't fix itself until the next burst of
>>>>>>>>> messages? Any suggestions on what to do or what to investigate
>>>>>>>>> next?
>>>>>>>>> I
>>>>>>>>> guess I could run these with strace and see what exactly they say
>>>>>>>>> to
>>>>>>>>> each other (both the sender and receiver).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    I'm not sure why (and if) this is 'fixed' by the next burst of
>>>>>>>>>
>>>>>>>>>  messages,
>>>>>>>> but this can be somehow because next portion of messages pushes the
>>>>>>>> queue. Perhaps it is unable to retry after suspend without a push.
>>>>>>>>
>>>>>>>>
>>>>>>>>       not sure if 'fixed' is the right word but it is always unstuck
>>>>>>>>
>>>>>>> right
>>>>>>> after next (sometime next to next) burst of messages, never at random
>>>>>>> time.
>>>>>>> This behaviour is the same whether the period is 5 min or 15 min.
>>>>>>>
>>>>>>>
>>>>>>>   While I have been silent, I followed the ML discussion (I did not
>>>>>>>
>>>>>> check
>>>>>> the
>>>>>> debug log further, though, as Pavel did excellent work here). To me,
>>>>>> it
>>>>>> looks like "normal" suspension code is kicking in. If an action fails,
>>>>>> rsyslog retries once (except otherwise configured) and if it fails
>>>>>> again,
>>>>>> the action is initially suspended for 30 seconds. Then, retries happen
>>>>>> and
>>>>>> the suspension period is prolonged if they fail.
>>>>>>
>>>>>> It looks very much like this is the mechanism at work. However, what I
>>>>>> don't understand is why the suspension period is so long.
>>>>>>
>>>>>> Out of all this, I think it would make much sense if rsyslog had the
>>>>>> capability to report when an action is suspended and when it is
>>>>>> resumed. I
>>>>>> am right now adding this capability. I would suggest that when this
>>>>>> change
>>>>>> is ready, you apply it and we can than see what it reports (much
>>>>>> easier
>>>>>> than walking the debug log, and very obvious to users ;)).
>>>>>>
>>>>>>
>>>>>>      as I mentioned I tried to do two changes to the test scenario and
>>>>> tested
>>>>> each of these separately:
>>>>>
>>>>>     - send 200 message burst directly from collector-test to
>>>>> collector-prod
>>>>> (RELP), i.e. no load balancer
>>>>>
>>>>>     - keep using the load balancer but in addition to 200 messages
>>>>> curst
>>>>> every 5 min also send few (3) messages every minute
>>>>>
>>>>>     Each of these scenarios work, as in traffic is smooth, NO silences.
>>>>> This
>>>>> means that something goes wrong if we use load balancer and there is no
>>>>> traffic for 5 minutes.
>>>>>
>>>>>     From our previous investigation of amazon elastic load balancer
>>>>> (which
>>>>> is what we're using) it often lies about the connection, i.e. it has no
>>>>> connection on the backend but it happily accepts connects and data and
>>>>> pretends everything is fine (that's why we initially switched from
>>>>> plain
>>>>> TCP to RELP).
>>>>>
>>>>>     Not entirely sure what the load balancer is doing in this case but
>>>>> it
>>>>> seems that rsyslog thinks the connection is fine and keeps sending data
>>>>> to
>>>>> the load balancer but connection is actually broken (maybe the load
>>>>> balancer closed the connection between itself and the destination
>>>>> because
>>>>> of some timeout).
>>>>>
>>>>>     So the situation is not fixed until rsyslog closes the connection
>>>>> and
>>>>> opens a new one.
>>>>>
>>>>>     Is there any way for us to either make rsyslog keep the connection
>>>>> alive
>>>>> or to re-open it sooner?
>>>>>
>>>>>           erik
>>>>>
>>>>> _______________________________________________
>>>>> rsyslog mailing list
>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>> http://www.rsyslog.com/professional-services/
>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>> myriad
>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>>> DON'T LIKE THAT.
>>>>>
>>>>>   _______________________________________________
>>>>>
>>>> rsyslog mailing list
>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>> http://www.rsyslog.com/professional-services/
>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>> DON'T LIKE THAT.
>>>>
>>>>
>>>>  _______________________________________________
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com/professional-services/
>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>> DON'T LIKE THAT.
>>>
>>>  _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
>>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Rsyslog with RELP not sending/receiving messages for long intervals

Reply via email to