On 11/27/2013 05:22 PM, David Lang wrote:
On Wed, 27 Nov 2013, Erik Steffl wrote:

On 11/26/2013 07:06 AM, Pavel Levshin wrote:

There may be multiple flaws, unfortunately. But one of them is
definitely in omrepl/librelp client side. Look below, in your debug log,
main queue thread is blocked many times for significant periods of time,
and this happens in omrelp, when it tries to establish a connection.

One suggestion: try to attach asynchronous queue with one thread to
action omrelp; this way, it will be more transparent to analyze, and it
will appear as a separate line in statistics. Obviously, it will not fix
your problem.

  Not sure what exactly you mean. Here's my action definition:

  action(
    type="omrelp"
    target="hostname"
    port="5140"
    template="json"
    queue.filename="json"
    queue.maxdiskspace="75161927680" # 70GB (valuable data)
  )

  From what I read (plus some discussion on this mailing list) I was
under impression that specifying queue.filename creates a queue for
this action.

actually, defining a queuetype is what creates a queue. defining a queue
filename converts that queue to a disk supported queue.

looking at http://www.rsyslog.com/doc/queues.html it seems that in the above case (want a disk assisted in-memory queue) I should use queue.type="LinkedList"? (or possibly FixedArray but not Direct or Disk).

And now the log of RELP client tries. This is the first message going to
omrelp, and it fails:

6501.183080355:7fe205cce700: tryDoAction 0x2676750, pnElem 1, nElem 1
6501.183089317:7fe205cce700: omrelp: beginTransaction
6501.199774684:7fe205cce700: frame to send: '1 open 85 relp_version=0
6501.199832179:7fe205cce700: ADD sess 0x7fe1fc006c00 unacked 1,
sessState 1
6501.199843299:7fe205cce700: sendbuf added to unacked list
6501.199856108:7fe205cce700: relp session read -1 octets, buf 'рTМ♣в⌂'
6501.199866355:7fe205cce700: relpSessWaitRsp waiting for data on fd 18,
timeout 90.0
*blocked here*
6560.213560747:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11
6560.213581624:7fe205cce700: relp session read 0 octets, buf ''
6560.213590296:7fe205cce700: server closed relp session 0x7fe1fc006c00,
session broken
6560.213598704:7fe205cce700: relpSessWaitState returns 10007
6560.213607411:7fe205cce700: end relpSessConnect, iRet 10007
6560.213659199:7fe205cce700: in destructor: sendbuf 0x7fe1fc07b630
6560.213673384:7fe205cce700: Action 0x2676750 transitioned to state:
rtry
6560.213691021:7fe205cce700: action 0x2676750 call returned -2007
6560.213700237:7fe205cce700: tryDoAction: unexpected error code
-2007[nElem 1, Commited UpTo 0], finalizing
6560.213708275:7fe205cce700: action ret RS_RET_SUSPENDED - retry full
batch
6560.213716346:7fe205cce700: tryDoAction 0x2676750, pnElem 1, nElem 1
6560.213726514:7fe205cce700: actionDoRetry: enter loop, iRetries=0
6560.215546195:7fe205cce700: frame to send: '1 open 85 relp_version=0
6560.215581986:7fe205cce700: ADD sess 0x7fe1fc001d60 unacked 1,
sessState 1
6560.215591846:7fe205cce700: sendbuf added to unacked list
6560.215603620:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
6560.215618450:7fe205cce700: relpSessWaitRsp waiting for data on fd 18,
timeout 90.0
*blocked here*
6620.213583492:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11
6620.213604912:7fe205cce700: relp session read 0 octets, buf ''
6620.213613612:7fe205cce700: server closed relp session 0x7fe1fc001d60,
session broken
...
6620.301584149:7fe205cce700: tryDoAction 0x2676750, pnElem 202, nElem
202
6620.301594581:7fe205cce700: actionTryResume: action 0x2676750 state:
susp, next retry (if applicable): 1385086650 [now 1385086620]
6620.301603383:7fe205cce700: omrelp: shutdownImmediate ptr now is
0x26754b8
6620.301611647:7fe205cce700: action 0x2676750 call returned -2123
6620.301619801:7fe205cce700: tryDoAction: unexpected error code
-2123[nElem 202, Commited UpTo 0], finalizing
6620.301634823:7fe205cce700: actionTryResume: action 0x2676750 state:
susp, next retry (if applicable): 1385086650 [now 1385086620]
6620.301645472:7fe205cce700: scriptExec: batch of 202 elements, active
0x7fe1fc0ad6f0, active[0]:0
6620.301653516:7fe205cce700:     ACTION 0x26770d0
[omrelp:action(type="omrelp" ...)]
...
6801.440540687:7fe205cce700: tryDoAction 0x2676750, pnElem 15, nElem 15
6801.440550455:7fe205cce700: Action 0x2676750 transitioned to state:
rtry
6801.440558423:7fe205cce700: actionDoRetry: enter loop, iRetries=0
6801.451531113:7fe205cce700: frame to send: '1 open 85 relp_version=0
6801.451558856:7fe205cce700: ADD sess 0x7fe1fc082e90 unacked 1,
sessState 1
6801.451577349:7fe205cce700: sendbuf added to unacked list
6801.451595362:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
6801.451606358:7fe205cce700: relpSessWaitRsp waiting for data on fd 18,
timeout 90.0
*blocked here*
6861.228866968:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11
6861.228899197:7fe205cce700: relp session read 0 octets, buf ''
6861.228913704:7fe205cce700: server closed relp session 0x7fe1fc082e90,
session broken
6861.228927199:7fe205cce700: relpSessWaitState returns 10007
6861.228940752:7fe205cce700: end relpSessConnect, iRet 10007
6861.229011547:7fe205cce700: in destructor: sendbuf 0x7fe1fc1722b0
6861.229030571:7fe205cce700: actionDoRetry: action->tryResume returned
-2007
6861.229053075:7fe205cce700: actionDoRetry: check for max retries,
iResumeRetryCount 0, iRetries 0
6861.229069635:7fe205cce700: Action 0x2676750 transitioned to state:
susp
6861.229082444:7fe205cce700: action suspended, earliest retry=1385086891
6861.229097179:7fe205cce700: action 0x2676750 call returned -2007
6861.229110552:7fe205cce700: tryDoAction: unexpected error code
-2007[nElem 15, Commited UpTo 0], finalizing
...
7101.658771671:7fe205cce700: tryDoAction 0x2676750, pnElem 2, nElem 2
7101.658781284:7fe205cce700: Action 0x2676750 transitioned to state:
rtry
7101.658883159:7fe205cce700: actionDoRetry: enter loop, iRetries=0
7101.669676285:7fe205cce700: frame to send: '1 open 85 relp_version=0

previous RELP session has died at last, and now, new one is initiated.

7101.669794405:7fe205cce700: ADD sess 0x7fe1fc006d70 unacked 1,
sessState 1
7101.669804532:7fe205cce700: sendbuf added to unacked list
7101.669817376:7fe205cce700: relp session read -1 octets, buf '°TМ♣в⌂'
7101.669834009:7fe205cce700: relpSessWaitRsp waiting for data on fd 17,
timeout 90.0
7103.444033377:7fe205cce700: relpSessWaitRsp select returns, nfds 1,
errno 11

and here, for the first time, there is a response from RELP server! 10
minutes after the first try. And right after errno 11? This does not
make sense for me.

7103.444060932:7fe205cce700: relp session read 102 octets, buf '1 rsp 92
200 OK
7103.444074603:7fe205cce700: relp engine is dispatching frame with
command 'rsp'
7103.444084099:7fe205cce700: in rsp command handler, txnr 1, code 200,
text 'OK'
7103.444093234:7fe205cce700: DEL sess 0x7fe1fc006d70 unacked 0,
sessState 2

  Thanks!!! Major breakthrough!!!

  Given that both sender (collector-test) and receiver
(collector-prod) work just fine during this time, colector-prod is
receiving and confirming tons of messages during this time it seems
that there might be something wrong with this particular connection.

  We have an amazon elastic load balancer (ELB) in front of
collector-prod so the TCP connection from collector-test goes to ELB
which connects to collector-prod.

  I changed collector-test to send RELP messages directly to collector
-prod and it works now. Ran it since yesterday and there wasn't a
single silence period.

  Not sure what's happening but from other investigations I noticed
that ELB seems to pretend everything went fine even when it does not
deliver messages anywhere (which is why we switched to RELP). It seems
that ELB keeps receiving RELP messages from collector-test but it does
not pass them to collector-prod. At some point collector-test reopens
connections (not sure if it's because ELB closes connection or rsyslog
simply decided enough is enough and new connection is needed) and it
starts working again.

  I guess 5 minutes of no communication is enough for ELB to timeout
something since this problem does not happen if the bursts are only 1
minute apart.

  Will run another test, adding ELB back but will add 1 message/minute
to keep the connection alive. That should confirm that it's ELB
related timeout.

  Would it makes sense for rsyslog to reopen connection after
timeouts? It seems that it receives several timeouts but it keeps
using same connection. given that we already have timeout the cost of
reopening connection is neglibible and it might resolve various
tcp-is-stuck issues (like this one).

  As you mentioned before there seems to be some kind of bug in
rsyslog that makes the outbound RELP problems also stop inbound RELP
communicaton, even if they are not (logically) related (in our test
scenario inbound RELP goes to omfile which should not be affected by
omrelp)

  Thoughts? Any ideas how to work around the problem?

It looks like the reason that the outbound RELP problems stop the
inbound RELP is because the main queue gets large enough to trigger
blocking (the soft delay stuff we were talking about a day or so ago)

the normal way that rsyslog works is that all inbound messages go to the
main queue, and then from there they get filtered and sent to the outputs.

If one output is blocked, messages cannot get processed from the main
queue and so it fills up.

If you define a set of rules as a different ruleset, and define an input
as using that ruleset, it creates a second 'main queue' for that ruleset
(you may have to define a queue for the ruleset, I'm not completely
sure). At that point that input and the outputs of that ruleset are
completely independant of the other inputs and outputs.

makes sense but not sure how to implement it. We use imuxsock to omrelp and imrelp to omfile so these two should be independent.

Here's how the imrelp to omfile is defined (left out some variable setup etc. for brevity):

ruleset(name="collector") {
  if prifilt("local0.*") then {
    action(type="mmjsonparse")
    if $parsesuccess == "OK" then {
      action(
        type="omfile"
        DynaFile="jsonFilename"
        Template="jsonFormat"
      )
    } else { ... }
    stop
  }
}

module(load="imrelp" ruleset="collector")
input(type="imrelp" port="5140")

So I already have a ruleset, should I add the queue definition to action? Or to input? Or did you mean something else?

  thanks!

        erik

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to