26.11.2013 1:20, Erik Steffl:

what does the above mean exactly? It does make sense, in each case the burst of messages gets in via /dev/log then is send out via RELP (either to the same machine or a different one).

Any ideas why this doesn't fix itself until the next burst of messages? Any suggestions on what to do or what to investigate next? I guess I could run these with strace and see what exactly they say to each other (both the sender and receiver).


I'm not sure why (and if) this is 'fixed' by the next burst of messages, but this can be somehow because next portion of messages pushes the queue. Perhaps it is unable to retry after suspend without a push.

There may be multiple flaws, unfortunately. But one of them is definitely in omrepl/librelp client side. Look below, in your debug log, main queue thread is blocked many times for significant periods of time, and this happens in omrelp, when it tries to establish a connection.

One suggestion: try to attach asynchronous queue with one thread to action omrelp; this way, it will be more transparent to analyze, and it will appear as a separate line in statistics. Obviously, it will not fix your problem.

Second, you have an action which tries to log to /dev/xconsole. It does not work, and this can affect overall performance. Try to remove this action.


And now the log of RELP client tries. This is the first message going to omrelp, and it fails:

6501.183080355:7fe205cce700: tryDoAction 0x2676750, pnElem 1, nElem 1
6501.183089317:7fe205cce700: omrelp: beginTransaction
6501.199774684:7fe205cce700: frame to send: '1 open 85 relp_version=0
6501.199832179:7fe205cce700: ADD sess 0x7fe1fc006c00 unacked 1, sessState 1
6501.199843299:7fe205cce700: sendbuf added to unacked list
6501.199856108:7fe205cce700: relp session read -1 octets, buf 'рTМ♣в⌂'
6501.199866355:7fe205cce700: relpSessWaitRsp waiting for data on fd 18, timeout 90.0
*blocked here*
6560.213560747:7fe205cce700: relpSessWaitRsp select returns, nfds 1, errno 11
6560.213581624:7fe205cce700: relp session read 0 octets, buf ''
6560.213590296:7fe205cce700: server closed relp session 0x7fe1fc006c00, session broken
6560.213598704:7fe205cce700: relpSessWaitState returns 10007
6560.213607411:7fe205cce700: end relpSessConnect, iRet 10007
6560.213659199:7fe205cce700: in destructor: sendbuf 0x7fe1fc07b630
6560.213673384:7fe205cce700: Action 0x2676750 transitioned to state: rtry
6560.213691021:7fe205cce700: action 0x2676750 call returned -2007
6560.213700237:7fe205cce700: tryDoAction: unexpected error code -2007[nElem 1, Commited UpTo 0], finalizing
6560.213708275:7fe205cce700: action ret RS_RET_SUSPENDED - retry full batch
6560.213716346:7fe205cce700: tryDoAction 0x2676750, pnElem 1, nElem 1
6560.213726514:7fe205cce700: actionDoRetry: enter loop, iRetries=0
6560.215546195:7fe205cce700: frame to send: '1 open 85 relp_version=0
6560.215581986:7fe205cce700: ADD sess 0x7fe1fc001d60 unacked 1, sessState 1
6560.215591846:7fe205cce700: sendbuf added to unacked list
6560.215603620:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
6560.215618450:7fe205cce700: relpSessWaitRsp waiting for data on fd 18, timeout 90.0
*blocked here*
6620.213583492:7fe205cce700: relpSessWaitRsp select returns, nfds 1, errno 11
6620.213604912:7fe205cce700: relp session read 0 octets, buf ''
6620.213613612:7fe205cce700: server closed relp session 0x7fe1fc001d60, session broken
...
6620.301584149:7fe205cce700: tryDoAction 0x2676750, pnElem 202, nElem 202
6620.301594581:7fe205cce700: actionTryResume: action 0x2676750 state: susp, next retry (if applicable): 1385086650 [now 1385086620]
6620.301603383:7fe205cce700: omrelp: shutdownImmediate ptr now is 0x26754b8
6620.301611647:7fe205cce700: action 0x2676750 call returned -2123
6620.301619801:7fe205cce700: tryDoAction: unexpected error code -2123[nElem 202, Commited UpTo 0], finalizing 6620.301634823:7fe205cce700: actionTryResume: action 0x2676750 state: susp, next retry (if applicable): 1385086650 [now 1385086620] 6620.301645472:7fe205cce700: scriptExec: batch of 202 elements, active 0x7fe1fc0ad6f0, active[0]:0 6620.301653516:7fe205cce700: ACTION 0x26770d0 [omrelp:action(type="omrelp" ...)]
...
6801.440540687:7fe205cce700: tryDoAction 0x2676750, pnElem 15, nElem 15
6801.440550455:7fe205cce700: Action 0x2676750 transitioned to state: rtry
6801.440558423:7fe205cce700: actionDoRetry: enter loop, iRetries=0
6801.451531113:7fe205cce700: frame to send: '1 open 85 relp_version=0
6801.451558856:7fe205cce700: ADD sess 0x7fe1fc082e90 unacked 1, sessState 1
6801.451577349:7fe205cce700: sendbuf added to unacked list
6801.451595362:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
6801.451606358:7fe205cce700: relpSessWaitRsp waiting for data on fd 18, timeout 90.0
*blocked here*
6861.228866968:7fe205cce700: relpSessWaitRsp select returns, nfds 1, errno 11
6861.228899197:7fe205cce700: relp session read 0 octets, buf ''
6861.228913704:7fe205cce700: server closed relp session 0x7fe1fc082e90, session broken
6861.228927199:7fe205cce700: relpSessWaitState returns 10007
6861.228940752:7fe205cce700: end relpSessConnect, iRet 10007
6861.229011547:7fe205cce700: in destructor: sendbuf 0x7fe1fc1722b0
6861.229030571:7fe205cce700: actionDoRetry: action->tryResume returned -2007
6861.229053075:7fe205cce700: actionDoRetry: check for max retries, iResumeRetryCount 0, iRetries 0
6861.229069635:7fe205cce700: Action 0x2676750 transitioned to state: susp
6861.229082444:7fe205cce700: action suspended, earliest retry=1385086891
6861.229097179:7fe205cce700: action 0x2676750 call returned -2007
6861.229110552:7fe205cce700: tryDoAction: unexpected error code -2007[nElem 15, Commited UpTo 0], finalizing
...
7101.658771671:7fe205cce700: tryDoAction 0x2676750, pnElem 2, nElem 2
7101.658781284:7fe205cce700: Action 0x2676750 transitioned to state: rtry
7101.658883159:7fe205cce700: actionDoRetry: enter loop, iRetries=0
7101.669676285:7fe205cce700: frame to send: '1 open 85 relp_version=0

previous RELP session has died at last, and now, new one is initiated.

7101.669794405:7fe205cce700: ADD sess 0x7fe1fc006d70 unacked 1, sessState 1
7101.669804532:7fe205cce700: sendbuf added to unacked list
7101.669817376:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
7101.669834009:7fe205cce700: relpSessWaitRsp waiting for data on fd 17, timeout 90.0 7103.444033377:7fe205cce700: relpSessWaitRsp select returns, nfds 1, errno 11

and here, for the first time, there is a response from RELP server! 10 minutes after the first try. And right after errno 11? This does not make sense for me.

7103.444060932:7fe205cce700: relp session read 102 octets, buf '1 rsp 92 200 OK 7103.444074603:7fe205cce700: relp engine is dispatching frame with command 'rsp' 7103.444084099:7fe205cce700: in rsp command handler, txnr 1, code 200, text 'OK'
7103.444093234:7fe205cce700: DEL sess 0x7fe1fc006d70 unacked 0, sessState 2


--
Pavel Levshin

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to