Re: [rsyslog] Rsyslog with RELP not sending/receiving messages for long intervals

Rainer Gerhards Fri, 06 Dec 2013 01:04:23 -0800

On Fri, Dec 6, 2013 at 10:00 AM, Erik Steffl <[email protected]> wrote:


> On 12/05/2013 11:27 PM, Rainer Gerhards wrote:
>
>> On Fri, Dec 6, 2013 at 8:17 AM, Erik Steffl <[email protected]> wrote:
>>
>>     compiled it successfully but didn't try it yet, was trying to figure
>>> out
>>> what's going on with ELB and timeout.
>>>
>>>    Will try it tomorrow (fri pacific time), what will I be looking for?
>>>
>>
>>
>> sorry, don't understand that sentence.
>>
>
>   is the new version going to print additional logs to /var/log/syslog?


yes, it will log messages when an action is suspended and when it is
reactivated again. I think I also added stats counters for these events.
But was already a couple of days (and > 100 commits) ago, so I don't know
exactly out of my head.


> Or debug logs? Would these be triggered when there is timeout waiting for
> RELP response?


It's triggered whenever the action decides it fails. That's exactly what I
would like to know.

Rainer


>
>  Do you think it will log anything interesting given the new information I
>>> just posted?
>>>
>>>
>>>  Honestly, I haven't look deeply at it, because I need the other
>> information
>> first. Don't waste time on trying to find out things if you don't have the
>> basics...
>>
>
>   alright, will test with rsyslog I complied from git (master) and see
> what happens,
>
>   thanks,
>
>         erik
>
>
>
>> Rainer
>>
>>
>>     thanks!
>>>
>>>          erik
>>>
>>>
>>> On 12/05/2013 10:34 PM, Rainer Gerhards wrote:
>>>
>>>  Erik,
>>>>
>>>> have you tried the version I created for you? What does it report?
>>>>
>>>> Rainer
>>>>
>>>>
>>>> On Fri, Dec 6, 2013 at 5:10 AM, Erik Steffl <[email protected]> wrote:
>>>>
>>>>      at this point pretty sure it's related to Amazon Elastic Load
>>>> Balancer
>>>>
>>>>> (ELB) and some kind of timeout (it only happens if I use ELB and there
>>>>> is
>>>>> enough time between bursts of messages, e.g. 5 min triggers failure, 1
>>>>> minute does not).
>>>>>
>>>>>     Now I am back to original failure scenario (ELB and burst of 100
>>>>> messages every 5 minutes).
>>>>>
>>>>>     Here's output from lsof:
>>>>>
>>>>> ubuntu@ip-10-158-97-169:~$ sudo lsof | grep rsyslogd | grep TCP
>>>>> rsyslogd   9743          syslog    6u     IPv4            7177271 0t0
>>>>>      TCP *:5140 (LISTEN)
>>>>> rsyslogd   9743          syslog    7u     IPv6            7177272 0t0
>>>>>      TCP *:5140 (LISTEN)
>>>>> rsyslogd   9743          syslog   10u     IPv4            7187191 0t0
>>>>>      TCP ip-10-158-97-169.ec2.internal:48176->ec2-50-19-250-187.
>>>>> compute-1.amazonaws.com:5140 (CLOSE_WAIT)
>>>>> rsyslogd   9743          syslog   12u     IPv4            7183726 0t0
>>>>>      TCP ip-10-158-97-169.ec2.internal:5140->216.133.125.82:16543(
>>>>> ESTABLISHED)
>>>>>
>>>>>     where:
>>>>>
>>>>> ip-10-158-97-169.ec2.internal is the collector-test, the host that is
>>>>> sending messages
>>>>>
>>>>> ec2-50-19-250-187.compute-1.amazonaws.com is ELB
>>>>>
>>>>> Strace shows some interesting stuff, even though I didn't catch the
>>>>> beginning of the problem this time (running strace with -f options so
>>>>> the
>>>>> first number on each line is thread pid):
>>>>>
>>>>> 9842  03:08:44.391025 sendto(10, "204 syslog 380
>>>>> <134>2013-12-06T02:55:01.747931+00:00
>>>>> ip-10-158-97-169 logBurst[9908]: @cee:{\"message-json\":{\"now\":\"Fri
>>>>> Dec  6 02:55:01 2013\",\"i\":99},\"yummlyLogOrigin\":{\"
>>>>> supportLevel\":\"prod\",\"system\":\"LOGS\",\"cluster\":
>>>>> \"prod\",\"role\":\"collectorErik\",\"host\":\"ip-
>>>>> 10-158-97-169\",\"tag\":\"logBurst[9908]:\",\"
>>>>> programname\":\"logBurst\",\"priority\":\"local0.info\",\"
>>>>> timestamp\":\"2013-12-06T02:55:01.747931+00:00\"}}\n\n", 396, 0,
>>>>> NULL, 0
>>>>> <unfinished ...>
>>>>>
>>>>> ... then nothing related to fd 10 until:
>>>>>
>>>>> 9842  03:10:32.676803 <... sendto resumed> ) = -1 ETIMEDOUT (Connection
>>>>> timed out)
>>>>> 9842  03:10:32.676887 gettimeofday({1386299432, 676918}, NULL) = 0
>>>>> 9842  03:10:32.676994 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1,
>>>>> linger=0}, 8) = 0
>>>>> 9842  03:10:32.677101 close(10)         = 0
>>>>>
>>>>>     So I assume that means the connection is bad, rsyslog figures it
>>>>> out
>>>>> and
>>>>> closes the connection.
>>>>>
>>>>>     Then rsyslog looks up IP of ELB and opens another connection:
>>>>>
>>>>> 9842  03:10:32.687983 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
>>>>> 9842  03:10:32.688088 connect(10, {sa_family=AF_INET,
>>>>> sin_port=htons(5140), sin_addr=inet_addr("50.19.250.187")}, 16) = 0
>>>>> 9842  03:10:32.689592 clock_gettime(CLOCK_REALTIME, {1386299432,
>>>>> 689625848}) = 0
>>>>> 9842  03:10:32.689694 sendto(10, "1 open 85
>>>>> relp_version=0\nrelp_software=
>>>>> librelp,1.2.0,http://librelp.adiscon.com\ncommands=syslog\n";, 96, 0,
>>>>> NULL, 0) = 96
>>>>> 9842  03:10:32.689812 recvfrom(10, 0x7ffc58b53490, 32768, 64, 0, 0) =
>>>>> -1
>>>>> EAGAIN (Resource temporarily unavailable)
>>>>> ...
>>>>> 9842  03:11:32.214608 recvfrom(10, "", 32768, MSG_DONTWAIT, NULL, NULL)
>>>>> = 0
>>>>> ...
>>>>> 9750  03:12:16.994327 setsockopt(10, SOL_SOCKET, SO_LINGER, {onoff=1,
>>>>> linger=0}, 8 <unfinished ...>
>>>>> 9750  03:12:16.994443 close(10 <unfinished ...>
>>>>> 9750  03:12:16.994597 <... close resumed> ) = 0
>>>>>
>>>>>     So it seems like rsyslog is trying to open a new RELP session but
>>>>> it's
>>>>> getting ETIMEDOUT. This is very weird because all involved hosts
>>>>> (sender,
>>>>> ELB, receiver) work fine at the same time, they continue to send and
>>>>> receive data, ELB thinks the receiver is healthy etc. Everything seems
>>>>> to
>>>>> be working fine except rsyslog is not able to open working connection
>>>>> to
>>>>> ELB (to recevier via ELB).
>>>>>
>>>>>     Any ideas how to investigate this further? This seems to be a lot
>>>>> more
>>>>> specific then what I had before (just mysterious silence) so hopefully
>>>>> it
>>>>> makes more sense...
>>>>>
>>>>>     thanks!
>>>>>
>>>>>           erik
>>>>>
>>>>>
>>>>> On 11/30/2013 02:12 AM, Rainer Gerhards wrote:
>>>>>
>>>>>   If the connection fails, rsyslog *immediately* tries to re-open, but
>>>>>
>>>>>> that
>>>>>> seems to fail as well.
>>>>>>
>>>>>> Btw: git v7-devel and master branch niw contain the changes i made.
>>>>>>   Could
>>>>>> you please try one of these versions.
>>>>>>
>>>>>> Sent from phone, thus brief.
>>>>>> Am 30.11.2013 00:39 schrieb "Erik Steffl" <[email protected]>:
>>>>>>
>>>>>>    On 11/29/2013 01:48 AM, Rainer Gerhards wrote:
>>>>>>
>>>>>>
>>>>>>>    On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>>     On 11/26/2013 07:06 AM, Pavel Levshin wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>    26.11.2013 1:20, Erik Steffl:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>        what does the above mean exactly? It does make sense, in
>>>>>>>>>> each
>>>>>>>>>>
>>>>>>>>>>  case
>>>>>>>>>>> the burst of messages gets in via /dev/log then is send out via
>>>>>>>>>>> RELP
>>>>>>>>>>> (either to the same machine or a different one).
>>>>>>>>>>>
>>>>>>>>>>>       Any ideas why this doesn't fix itself until the next burst
>>>>>>>>>>> of
>>>>>>>>>>> messages? Any suggestions on what to do or what to investigate
>>>>>>>>>>> next?
>>>>>>>>>>> I
>>>>>>>>>>> guess I could run these with strace and see what exactly they say
>>>>>>>>>>> to
>>>>>>>>>>> each other (both the sender and receiver).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>     I'm not sure why (and if) this is 'fixed' by the next burst
>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>   messages,
>>>>>>>>>>>
>>>>>>>>>> but this can be somehow because next portion of messages pushes
>>>>>>>>>> the
>>>>>>>>>> queue. Perhaps it is unable to retry after suspend without a push.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>        not sure if 'fixed' is the right word but it is always
>>>>>>>>>> unstuck
>>>>>>>>>>
>>>>>>>>>>  right
>>>>>>>>> after next (sometime next to next) burst of messages, never at
>>>>>>>>> random
>>>>>>>>> time.
>>>>>>>>> This behaviour is the same whether the period is 5 min or 15 min.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    While I have been silent, I followed the ML discussion (I did
>>>>>>>>> not
>>>>>>>>>
>>>>>>>>>  check
>>>>>>>> the
>>>>>>>> debug log further, though, as Pavel did excellent work here). To me,
>>>>>>>> it
>>>>>>>> looks like "normal" suspension code is kicking in. If an action
>>>>>>>> fails,
>>>>>>>> rsyslog retries once (except otherwise configured) and if it fails
>>>>>>>> again,
>>>>>>>> the action is initially suspended for 30 seconds. Then, retries
>>>>>>>> happen
>>>>>>>> and
>>>>>>>> the suspension period is prolonged if they fail.
>>>>>>>>
>>>>>>>> It looks very much like this is the mechanism at work. However,
>>>>>>>> what I
>>>>>>>> don't understand is why the suspension period is so long.
>>>>>>>>
>>>>>>>> Out of all this, I think it would make much sense if rsyslog had the
>>>>>>>> capability to report when an action is suspended and when it is
>>>>>>>> resumed. I
>>>>>>>> am right now adding this capability. I would suggest that when this
>>>>>>>> change
>>>>>>>> is ready, you apply it and we can than see what it reports (much
>>>>>>>> easier
>>>>>>>> than walking the debug log, and very obvious to users ;)).
>>>>>>>>
>>>>>>>>
>>>>>>>>       as I mentioned I tried to do two changes to the test scenario
>>>>>>>> and
>>>>>>>>
>>>>>>> tested
>>>>>>> each of these separately:
>>>>>>>
>>>>>>>      - send 200 message burst directly from collector-test to
>>>>>>> collector-prod
>>>>>>> (RELP), i.e. no load balancer
>>>>>>>
>>>>>>>      - keep using the load balancer but in addition to 200 messages
>>>>>>> curst
>>>>>>> every 5 min also send few (3) messages every minute
>>>>>>>
>>>>>>>      Each of these scenarios work, as in traffic is smooth, NO
>>>>>>> silences.
>>>>>>> This
>>>>>>> means that something goes wrong if we use load balancer and there is
>>>>>>> no
>>>>>>> traffic for 5 minutes.
>>>>>>>
>>>>>>>      From our previous investigation of amazon elastic load balancer
>>>>>>> (which
>>>>>>> is what we're using) it often lies about the connection, i.e. it has
>>>>>>> no
>>>>>>> connection on the backend but it happily accepts connects and data
>>>>>>> and
>>>>>>> pretends everything is fine (that's why we initially switched from
>>>>>>> plain
>>>>>>> TCP to RELP).
>>>>>>>
>>>>>>>      Not entirely sure what the load balancer is doing in this case
>>>>>>> but
>>>>>>> it
>>>>>>> seems that rsyslog thinks the connection is fine and keeps sending
>>>>>>> data
>>>>>>> to
>>>>>>> the load balancer but connection is actually broken (maybe the load
>>>>>>> balancer closed the connection between itself and the destination
>>>>>>> because
>>>>>>> of some timeout).
>>>>>>>
>>>>>>>      So the situation is not fixed until rsyslog closes the
>>>>>>> connection
>>>>>>> and
>>>>>>> opens a new one.
>>>>>>>
>>>>>>>      Is there any way for us to either make rsyslog keep the
>>>>>>> connection
>>>>>>> alive
>>>>>>> or to re-open it sooner?
>>>>>>>
>>>>>>>            erik
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> rsyslog mailing list
>>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>>> http://www.rsyslog.com/professional-services/
>>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>>> myriad
>>>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if
>>>>>>> you
>>>>>>> DON'T LIKE THAT.
>>>>>>>
>>>>>>>    _______________________________________________
>>>>>>>
>>>>>>>  rsyslog mailing list
>>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>>> http://www.rsyslog.com/professional-services/
>>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>>> myriad
>>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>>>> DON'T LIKE THAT.
>>>>>>
>>>>>>
>>>>>>   _______________________________________________
>>>>>>
>>>>> rsyslog mailing list
>>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>>> http://www.rsyslog.com/professional-services/
>>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
>>>>> myriad
>>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>>> DON'T LIKE THAT.
>>>>>
>>>>>   _______________________________________________
>>>>>
>>>> rsyslog mailing list
>>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>>> http://www.rsyslog.com/professional-services/
>>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>>> DON'T LIKE THAT.
>>>>
>>>>
>>>>  _______________________________________________
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com/professional-services/
>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>>> DON'T LIKE THAT.
>>>
>>>  _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
>>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Rsyslog with RELP not sending/receiving messages for long intervals

Reply via email to