Re: [rsyslog] Rsyslog with RELP not sending/receiving messages for long intervals

David Lang Wed, 27 Nov 2013 17:22:45 -0800

On Wed, 27 Nov 2013, Erik Steffl wrote:

On 11/26/2013 07:06 AM, Pavel Levshin wrote:


There may be multiple flaws, unfortunately. But one of them is
definitely in omrepl/librelp client side. Look below, in your debug log,
main queue thread is blocked many times for significant periods of time,
and this happens in omrelp, when it tries to establish a connection.

One suggestion: try to attach asynchronous queue with one thread to
action omrelp; this way, it will be more transparent to analyze, and it
will appear as a separate line in statistics. Obviously, it will not fix
your problem.


  Not sure what exactly you mean. Here's my action definition:

  action(
    type="omrelp"
    target="hostname"
    port="5140"
    template="json"
    queue.filename="json"
    queue.maxdiskspace="75161927680" # 70GB (valuable data)
  )

From what I read (plus some discussion on this mailing list) I wasunder impression that specifying queue.filename creates a queue for thisaction.

actually, defining a queuetype is what creates a queue. defining a queuefilename converts that queue to a disk supported queue.

And now the log of RELP client tries. This is the first message going to
omrelp, and it fails:

6501.183080355:7fe205cce700: tryDoAction 0x2676750, pnElem 1, nElem 1
6501.183089317:7fe205cce700: omrelp: beginTransaction
6501.199774684:7fe205cce700: frame to send: '1 open 85 relp_version=0
6501.199832179:7fe205cce700: ADD sess 0x7fe1fc006c00 unacked 1, sessState 1
6501.199843299:7fe205cce700: sendbuf added to unacked list
6501.199856108:7fe205cce700: relp session read -1 octets, buf 'рTМ♣в⌂'
6501.199866355:7fe205cce700: relpSessWaitRsp waiting for data on fd 18,
timeout 90.0
*blocked here*
6560.213560747:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11
6560.213581624:7fe205cce700: relp session read 0 octets, buf ''
6560.213590296:7fe205cce700: server closed relp session 0x7fe1fc006c00,
session broken
6560.213598704:7fe205cce700: relpSessWaitState returns 10007
6560.213607411:7fe205cce700: end relpSessConnect, iRet 10007
6560.213659199:7fe205cce700: in destructor: sendbuf 0x7fe1fc07b630
6560.213673384:7fe205cce700: Action 0x2676750 transitioned to state: rtry
6560.213691021:7fe205cce700: action 0x2676750 call returned -2007
6560.213700237:7fe205cce700: tryDoAction: unexpected error code
-2007[nElem 1, Commited UpTo 0], finalizing
6560.213708275:7fe205cce700: action ret RS_RET_SUSPENDED - retry full batch
6560.213716346:7fe205cce700: tryDoAction 0x2676750, pnElem 1, nElem 1
6560.213726514:7fe205cce700: actionDoRetry: enter loop, iRetries=0
6560.215546195:7fe205cce700: frame to send: '1 open 85 relp_version=0
6560.215581986:7fe205cce700: ADD sess 0x7fe1fc001d60 unacked 1, sessState 1
6560.215591846:7fe205cce700: sendbuf added to unacked list
6560.215603620:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
6560.215618450:7fe205cce700: relpSessWaitRsp waiting for data on fd 18,
timeout 90.0
*blocked here*
6620.213583492:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11
6620.213604912:7fe205cce700: relp session read 0 octets, buf ''
6620.213613612:7fe205cce700: server closed relp session 0x7fe1fc001d60,
session broken
...
6620.301584149:7fe205cce700: tryDoAction 0x2676750, pnElem 202, nElem 202
6620.301594581:7fe205cce700: actionTryResume: action 0x2676750 state:
susp, next retry (if applicable): 1385086650 [now 1385086620]
6620.301603383:7fe205cce700: omrelp: shutdownImmediate ptr now is 0x26754b8
6620.301611647:7fe205cce700: action 0x2676750 call returned -2123
6620.301619801:7fe205cce700: tryDoAction: unexpected error code
-2123[nElem 202, Commited UpTo 0], finalizing
6620.301634823:7fe205cce700: actionTryResume: action 0x2676750 state:
susp, next retry (if applicable): 1385086650 [now 1385086620]
6620.301645472:7fe205cce700: scriptExec: batch of 202 elements, active
0x7fe1fc0ad6f0, active[0]:0
6620.301653516:7fe205cce700:     ACTION 0x26770d0
[omrelp:action(type="omrelp" ...)]
...
6801.440540687:7fe205cce700: tryDoAction 0x2676750, pnElem 15, nElem 15
6801.440550455:7fe205cce700: Action 0x2676750 transitioned to state: rtry
6801.440558423:7fe205cce700: actionDoRetry: enter loop, iRetries=0
6801.451531113:7fe205cce700: frame to send: '1 open 85 relp_version=0
6801.451558856:7fe205cce700: ADD sess 0x7fe1fc082e90 unacked 1, sessState 1
6801.451577349:7fe205cce700: sendbuf added to unacked list
6801.451595362:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
6801.451606358:7fe205cce700: relpSessWaitRsp waiting for data on fd 18,
timeout 90.0
*blocked here*
6861.228866968:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11
6861.228899197:7fe205cce700: relp session read 0 octets, buf ''
6861.228913704:7fe205cce700: server closed relp session 0x7fe1fc082e90,
session broken
6861.228927199:7fe205cce700: relpSessWaitState returns 10007
6861.228940752:7fe205cce700: end relpSessConnect, iRet 10007
6861.229011547:7fe205cce700: in destructor: sendbuf 0x7fe1fc1722b0
6861.229030571:7fe205cce700: actionDoRetry: action->tryResume returned
-2007
6861.229053075:7fe205cce700: actionDoRetry: check for max retries,
iResumeRetryCount 0, iRetries 0
6861.229069635:7fe205cce700: Action 0x2676750 transitioned to state: susp
6861.229082444:7fe205cce700: action suspended, earliest retry=1385086891
6861.229097179:7fe205cce700: action 0x2676750 call returned -2007
6861.229110552:7fe205cce700: tryDoAction: unexpected error code
-2007[nElem 15, Commited UpTo 0], finalizing
...
7101.658771671:7fe205cce700: tryDoAction 0x2676750, pnElem 2, nElem 2
7101.658781284:7fe205cce700: Action 0x2676750 transitioned to state: rtry
7101.658883159:7fe205cce700: actionDoRetry: enter loop, iRetries=0
7101.669676285:7fe205cce700: frame to send: '1 open 85 relp_version=0

previous RELP session has died at last, and now, new one is initiated.

7101.669794405:7fe205cce700: ADD sess 0x7fe1fc006d70 unacked 1, sessState 1
7101.669804532:7fe205cce700: sendbuf added to unacked list
7101.669817376:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
7101.669834009:7fe205cce700: relpSessWaitRsp waiting for data on fd 17,
timeout 90.0
7103.444033377:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11

and here, for the first time, there is a response from RELP server! 10
minutes after the first try. And right after errno 11? This does not
make sense for me.

7103.444060932:7fe205cce700: relp session read 102 octets, buf '1 rsp 92
200 OK
7103.444074603:7fe205cce700: relp engine is dispatching frame with
command 'rsp'
7103.444084099:7fe205cce700: in rsp command handler, txnr 1, code 200,
text 'OK'
7103.444093234:7fe205cce700: DEL sess 0x7fe1fc006d70 unacked 0, sessState 2


  Thanks!!! Major breakthrough!!!

Given that both sender (collector-test) and receiver (collector-prod)work just fine during this time, colector-prod is receiving andconfirming tons of messages during this time it seems that there mightbe something wrong with this particular connection.

We have an amazon elastic load balancer (ELB) in front ofcollector-prod so the TCP connection from collector-test goes to ELBwhich connects to collector-prod.

I changed collector-test to send RELP messages directly to collector-prod and it works now. Ran it since yesterday and there wasn't a singlesilence period.

Not sure what's happening but from other investigations I noticedthat ELB seems to pretend everything went fine even when it does notdeliver messages anywhere (which is why we switched to RELP). It seemsthat ELB keeps receiving RELP messages from collector-test but it doesnot pass them to collector-prod. At some point collector-test reopensconnections (not sure if it's because ELB closes connection or rsyslogsimply decided enough is enough and new connection is needed) and itstarts working again.

I guess 5 minutes of no communication is enough for ELB to timeoutsomething since this problem does not happen if the bursts are only 1minute apart.

Will run another test, adding ELB back but will add 1 message/minuteto keep the connection alive. That should confirm that it's ELB relatedtimeout.

Would it makes sense for rsyslog to reopen connection after timeouts?It seems that it receives several timeouts but it keeps using sameconnection. given that we already have timeout the cost of reopeningconnection is neglibible and it might resolve various tcp-is-stuckissues (like this one).

As you mentioned before there seems to be some kind of bug in rsyslogthat makes the outbound RELP problems also stop inbound RELPcommunicaton, even if they are not (logically) related (in our testscenario inbound RELP goes to omfile which should not be affected by omrelp)


  Thoughts? Any ideas how to work around the problem?

It looks like the reason that the outbound RELP problems stop the inbound RELPis because the main queue gets large enough to trigger blocking (the soft delaystuff we were talking about a day or so ago)

the normal way that rsyslog works is that all inbound messages go to the mainqueue, and then from there they get filtered and sent to the outputs.

If one output is blocked, messages cannot get processed from the main queue andso it fills up.

If you define a set of rules as a different ruleset, and define an input asusing that ruleset, it creates a second 'main queue' for that ruleset (you mayhave to define a queue for the ruleset, I'm not completely sure). At that pointthat input and the outputs of that ruleset are completely independant of theother inputs and outputs.


David Lang

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] Rsyslog with RELP not sending/receiving messages for long intervals

Reply via email to