On 11/29/2013 01:48 AM, Rainer Gerhards wrote:
On Thu, Nov 28, 2013 at 2:07 AM, Erik Steffl <[email protected]> wrote:

On 11/26/2013 07:06 AM, Pavel Levshin wrote:


26.11.2013 1:20, Erik Steffl:


   what does the above mean exactly? It does make sense, in each case
the burst of messages gets in via /dev/log then is send out via RELP
(either to the same machine or a different one).

   Any ideas why this doesn't fix itself until the next burst of
messages? Any suggestions on what to do or what to investigate next? I
guess I could run these with strace and see what exactly they say to
each other (both the sender and receiver).


I'm not sure why (and if) this is 'fixed' by the next burst of messages,
but this can be somehow because next portion of messages pushes the
queue. Perhaps it is unable to retry after suspend without a push.


   not sure if 'fixed' is the right word but it is always unstuck right
after next (sometime next to next) burst of messages, never at random time.
This behaviour is the same whether the period is 5 min or 15 min.


While I have been silent, I followed the ML discussion (I did not check the
debug log further, though, as Pavel did excellent work here). To me, it
looks like "normal" suspension code is kicking in. If an action fails,
rsyslog retries once (except otherwise configured) and if it fails again,
the action is initially suspended for 30 seconds. Then, retries happen and
the suspension period is prolonged if they fail.

It looks very much like this is the mechanism at work. However, what I
don't understand is why the suspension period is so long.

Out of all this, I think it would make much sense if rsyslog had the
capability to report when an action is suspended and when it is resumed. I
am right now adding this capability. I would suggest that when this change
is ready, you apply it and we can than see what it reports (much easier
than walking the debug log, and very obvious to users ;)).

as I mentioned I tried to do two changes to the test scenario and tested each of these separately:

- send 200 message burst directly from collector-test to collector-prod (RELP), i.e. no load balancer

- keep using the load balancer but in addition to 200 messages curst every 5 min also send few (3) messages every minute

Each of these scenarios work, as in traffic is smooth, NO silences. This means that something goes wrong if we use load balancer and there is no traffic for 5 minutes.

From our previous investigation of amazon elastic load balancer (which is what we're using) it often lies about the connection, i.e. it has no connection on the backend but it happily accepts connects and data and pretends everything is fine (that's why we initially switched from plain TCP to RELP).

Not entirely sure what the load balancer is doing in this case but it seems that rsyslog thinks the connection is fine and keeps sending data to the load balancer but connection is actually broken (maybe the load balancer closed the connection between itself and the destination because of some timeout).

So the situation is not fixed until rsyslog closes the connection and opens a new one.

Is there any way for us to either make rsyslog keep the connection alive or to re-open it sooner?

        erik

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to