On Wed, 27 Nov 2013, Erik Steffl wrote:

On 11/26/2013 07:06 AM, Pavel Levshin wrote:

There may be multiple flaws, unfortunately. But one of them is
definitely in omrepl/librelp client side. Look below, in your debug log,
main queue thread is blocked many times for significant periods of time,
and this happens in omrelp, when it tries to establish a connection.

One suggestion: try to attach asynchronous queue with one thread to
action omrelp; this way, it will be more transparent to analyze, and it
will appear as a separate line in statistics. Obviously, it will not fix
your problem.

  Not sure what exactly you mean. Here's my action definition:

  action(
    type="omrelp"
    target="hostname"
    port="5140"
    template="json"
    queue.filename="json"
    queue.maxdiskspace="75161927680" # 70GB (valuable data)
  )

From what I read (plus some discussion on this mailing list) I was under impression that specifying queue.filename creates a queue for this action.

actually, defining a queuetype is what creates a queue. defining a queue filename converts that queue to a disk supported queue.

And now the log of RELP client tries. This is the first message going to
omrelp, and it fails:

6501.183080355:7fe205cce700: tryDoAction 0x2676750, pnElem 1, nElem 1
6501.183089317:7fe205cce700: omrelp: beginTransaction
6501.199774684:7fe205cce700: frame to send: '1 open 85 relp_version=0
6501.199832179:7fe205cce700: ADD sess 0x7fe1fc006c00 unacked 1, sessState 1
6501.199843299:7fe205cce700: sendbuf added to unacked list
6501.199856108:7fe205cce700: relp session read -1 octets, buf 'рTМ♣в⌂'
6501.199866355:7fe205cce700: relpSessWaitRsp waiting for data on fd 18,
timeout 90.0
*blocked here*
6560.213560747:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11
6560.213581624:7fe205cce700: relp session read 0 octets, buf ''
6560.213590296:7fe205cce700: server closed relp session 0x7fe1fc006c00,
session broken
6560.213598704:7fe205cce700: relpSessWaitState returns 10007
6560.213607411:7fe205cce700: end relpSessConnect, iRet 10007
6560.213659199:7fe205cce700: in destructor: sendbuf 0x7fe1fc07b630
6560.213673384:7fe205cce700: Action 0x2676750 transitioned to state: rtry
6560.213691021:7fe205cce700: action 0x2676750 call returned -2007
6560.213700237:7fe205cce700: tryDoAction: unexpected error code
-2007[nElem 1, Commited UpTo 0], finalizing
6560.213708275:7fe205cce700: action ret RS_RET_SUSPENDED - retry full batch
6560.213716346:7fe205cce700: tryDoAction 0x2676750, pnElem 1, nElem 1
6560.213726514:7fe205cce700: actionDoRetry: enter loop, iRetries=0
6560.215546195:7fe205cce700: frame to send: '1 open 85 relp_version=0
6560.215581986:7fe205cce700: ADD sess 0x7fe1fc001d60 unacked 1, sessState 1
6560.215591846:7fe205cce700: sendbuf added to unacked list
6560.215603620:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
6560.215618450:7fe205cce700: relpSessWaitRsp waiting for data on fd 18,
timeout 90.0
*blocked here*
6620.213583492:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11
6620.213604912:7fe205cce700: relp session read 0 octets, buf ''
6620.213613612:7fe205cce700: server closed relp session 0x7fe1fc001d60,
session broken
...
6620.301584149:7fe205cce700: tryDoAction 0x2676750, pnElem 202, nElem 202
6620.301594581:7fe205cce700: actionTryResume: action 0x2676750 state:
susp, next retry (if applicable): 1385086650 [now 1385086620]
6620.301603383:7fe205cce700: omrelp: shutdownImmediate ptr now is 0x26754b8
6620.301611647:7fe205cce700: action 0x2676750 call returned -2123
6620.301619801:7fe205cce700: tryDoAction: unexpected error code
-2123[nElem 202, Commited UpTo 0], finalizing
6620.301634823:7fe205cce700: actionTryResume: action 0x2676750 state:
susp, next retry (if applicable): 1385086650 [now 1385086620]
6620.301645472:7fe205cce700: scriptExec: batch of 202 elements, active
0x7fe1fc0ad6f0, active[0]:0
6620.301653516:7fe205cce700:     ACTION 0x26770d0
[omrelp:action(type="omrelp" ...)]
...
6801.440540687:7fe205cce700: tryDoAction 0x2676750, pnElem 15, nElem 15
6801.440550455:7fe205cce700: Action 0x2676750 transitioned to state: rtry
6801.440558423:7fe205cce700: actionDoRetry: enter loop, iRetries=0
6801.451531113:7fe205cce700: frame to send: '1 open 85 relp_version=0
6801.451558856:7fe205cce700: ADD sess 0x7fe1fc082e90 unacked 1, sessState 1
6801.451577349:7fe205cce700: sendbuf added to unacked list
6801.451595362:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
6801.451606358:7fe205cce700: relpSessWaitRsp waiting for data on fd 18,
timeout 90.0
*blocked here*
6861.228866968:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11
6861.228899197:7fe205cce700: relp session read 0 octets, buf ''
6861.228913704:7fe205cce700: server closed relp session 0x7fe1fc082e90,
session broken
6861.228927199:7fe205cce700: relpSessWaitState returns 10007
6861.228940752:7fe205cce700: end relpSessConnect, iRet 10007
6861.229011547:7fe205cce700: in destructor: sendbuf 0x7fe1fc1722b0
6861.229030571:7fe205cce700: actionDoRetry: action->tryResume returned
-2007
6861.229053075:7fe205cce700: actionDoRetry: check for max retries,
iResumeRetryCount 0, iRetries 0
6861.229069635:7fe205cce700: Action 0x2676750 transitioned to state: susp
6861.229082444:7fe205cce700: action suspended, earliest retry=1385086891
6861.229097179:7fe205cce700: action 0x2676750 call returned -2007
6861.229110552:7fe205cce700: tryDoAction: unexpected error code
-2007[nElem 15, Commited UpTo 0], finalizing
...
7101.658771671:7fe205cce700: tryDoAction 0x2676750, pnElem 2, nElem 2
7101.658781284:7fe205cce700: Action 0x2676750 transitioned to state: rtry
7101.658883159:7fe205cce700: actionDoRetry: enter loop, iRetries=0
7101.669676285:7fe205cce700: frame to send: '1 open 85 relp_version=0

previous RELP session has died at last, and now, new one is initiated.

7101.669794405:7fe205cce700: ADD sess 0x7fe1fc006d70 unacked 1, sessState 1
7101.669804532:7fe205cce700: sendbuf added to unacked list
7101.669817376:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
7101.669834009:7fe205cce700: relpSessWaitRsp waiting for data on fd 17,
timeout 90.0
7103.444033377:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11

and here, for the first time, there is a response from RELP server! 10
minutes after the first try. And right after errno 11? This does not
make sense for me.

7103.444060932:7fe205cce700: relp session read 102 octets, buf '1 rsp 92
200 OK
7103.444074603:7fe205cce700: relp engine is dispatching frame with
command 'rsp'
7103.444084099:7fe205cce700: in rsp command handler, txnr 1, code 200,
text 'OK'
7103.444093234:7fe205cce700: DEL sess 0x7fe1fc006d70 unacked 0, sessState 2

  Thanks!!! Major breakthrough!!!

Given that both sender (collector-test) and receiver (collector-prod) work just fine during this time, colector-prod is receiving and confirming tons of messages during this time it seems that there might be something wrong with this particular connection.

We have an amazon elastic load balancer (ELB) in front of collector-prod so the TCP connection from collector-test goes to ELB which connects to collector-prod.

I changed collector-test to send RELP messages directly to collector -prod and it works now. Ran it since yesterday and there wasn't a single silence period.

Not sure what's happening but from other investigations I noticed that ELB seems to pretend everything went fine even when it does not deliver messages anywhere (which is why we switched to RELP). It seems that ELB keeps receiving RELP messages from collector-test but it does not pass them to collector-prod. At some point collector-test reopens connections (not sure if it's because ELB closes connection or rsyslog simply decided enough is enough and new connection is needed) and it starts working again.

I guess 5 minutes of no communication is enough for ELB to timeout something since this problem does not happen if the bursts are only 1 minute apart.

Will run another test, adding ELB back but will add 1 message/minute to keep the connection alive. That should confirm that it's ELB related timeout.

Would it makes sense for rsyslog to reopen connection after timeouts? It seems that it receives several timeouts but it keeps using same connection. given that we already have timeout the cost of reopening connection is neglibible and it might resolve various tcp-is-stuck issues (like this one).

As you mentioned before there seems to be some kind of bug in rsyslog that makes the outbound RELP problems also stop inbound RELP communicaton, even if they are not (logically) related (in our test scenario inbound RELP goes to omfile which should not be affected by omrelp)

  Thoughts? Any ideas how to work around the problem?

It looks like the reason that the outbound RELP problems stop the inbound RELP is because the main queue gets large enough to trigger blocking (the soft delay stuff we were talking about a day or so ago)

the normal way that rsyslog works is that all inbound messages go to the main queue, and then from there they get filtered and sent to the outputs.

If one output is blocked, messages cannot get processed from the main queue and so it fills up.

If you define a set of rules as a different ruleset, and define an input as using that ruleset, it creates a second 'main queue' for that ruleset (you may have to define a queue for the ruleset, I'm not completely sure). At that point that input and the outputs of that ruleset are completely independant of the other inputs and outputs.

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to