On Wed, 27 Nov 2013, Erik Steffl wrote:

On 11/27/2013 05:22 PM, David Lang wrote:
On Wed, 27 Nov 2013, Erik Steffl wrote:

On 11/26/2013 07:06 AM, Pavel Levshin wrote:

There may be multiple flaws, unfortunately. But one of them is
definitely in omrepl/librelp client side. Look below, in your debug log,
main queue thread is blocked many times for significant periods of time,
and this happens in omrelp, when it tries to establish a connection.

One suggestion: try to attach asynchronous queue with one thread to
action omrelp; this way, it will be more transparent to analyze, and it
will appear as a separate line in statistics. Obviously, it will not fix
your problem.

  Not sure what exactly you mean. Here's my action definition:

  action(
    type="omrelp"
    target="hostname"
    port="5140"
    template="json"
    queue.filename="json"
    queue.maxdiskspace="75161927680" # 70GB (valuable data)
  )

  From what I read (plus some discussion on this mailing list) I was
under impression that specifying queue.filename creates a queue for
this action.

actually, defining a queuetype is what creates a queue. defining a queue
filename converts that queue to a disk supported queue.

looking at http://www.rsyslog.com/doc/queues.html it seems that in the above case (want a disk assisted in-memory queue) I should use queue.type="LinkedList"? (or possibly FixedArray but not Direct or Disk).

correct, you need to add a queue.type to create a queue for this, not just queue.filename.


  Given that both sender (collector-test) and receiver
(collector-prod) work just fine during this time, colector-prod is
receiving and confirming tons of messages during this time it seems
that there might be something wrong with this particular connection.

  We have an amazon elastic load balancer (ELB) in front of
collector-prod so the TCP connection from collector-test goes to ELB
which connects to collector-prod.

  I changed collector-test to send RELP messages directly to collector
-prod and it works now. Ran it since yesterday and there wasn't a
single silence period.

  Not sure what's happening but from other investigations I noticed
that ELB seems to pretend everything went fine even when it does not
deliver messages anywhere (which is why we switched to RELP). It seems
that ELB keeps receiving RELP messages from collector-test but it does
not pass them to collector-prod. At some point collector-test reopens
connections (not sure if it's because ELB closes connection or rsyslog
simply decided enough is enough and new connection is needed) and it
starts working again.

  I guess 5 minutes of no communication is enough for ELB to timeout
something since this problem does not happen if the bursts are only 1
minute apart.

  Will run another test, adding ELB back but will add 1 message/minute
to keep the connection alive. That should confirm that it's ELB
related timeout.

  Would it makes sense for rsyslog to reopen connection after
timeouts? It seems that it receives several timeouts but it keeps
using same connection. given that we already have timeout the cost of
reopening connection is neglibible and it might resolve various
tcp-is-stuck issues (like this one).

  As you mentioned before there seems to be some kind of bug in
rsyslog that makes the outbound RELP problems also stop inbound RELP
communicaton, even if they are not (logically) related (in our test
scenario inbound RELP goes to omfile which should not be affected by
omrelp)

  Thoughts? Any ideas how to work around the problem?

It looks like the reason that the outbound RELP problems stop the
inbound RELP is because the main queue gets large enough to trigger
blocking (the soft delay stuff we were talking about a day or so ago)

the normal way that rsyslog works is that all inbound messages go to the
main queue, and then from there they get filtered and sent to the outputs.

If one output is blocked, messages cannot get processed from the main
queue and so it fills up.

If you define a set of rules as a different ruleset, and define an input
as using that ruleset, it creates a second 'main queue' for that ruleset
(you may have to define a queue for the ruleset, I'm not completely
sure). At that point that input and the outputs of that ruleset are
completely independant of the other inputs and outputs.

makes sense but not sure how to implement it. We use imuxsock to omrelp and imrelp to omfile so these two should be independent.

Here's how the imrelp to omfile is defined (left out some variable setup etc. for brevity):

ruleset(name="collector") {
  if prifilt("local0.*") then {
    action(type="mmjsonparse")
    if $parsesuccess == "OK" then {
      action(
        type="omfile"
        DynaFile="jsonFilename"
        Template="jsonFormat"
      )
    } else { ... }
    stop
  }
}

module(load="imrelp" ruleset="collector")
input(type="imrelp" port="5140")

So I already have a ruleset, should I add the queue definition to action? Or to input? Or did you mean something else?

I think you need to define a queue as part of the ruleset.

After you do this, you should be able to see these threads separately when you run top (after hitting 'H' to change the display to per-thread) and the pstats output should show a separate queue for this ruleset.

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to