[rsyslog] Multiple-server failover questions (and round-robin DNS resolution, too)

Ryan Lynch Tue, 08 Dec 2009 17:19:26 -0800

In the doc example, the server "primary-syslog.example.com" is always
tried first, unless it's down, followed by
"secondary-1-syslog.example.com", and then finally
"secondary-2-syslog.example.com" only if both of the others are down.
Here's the doc in question, BTW:


 * http://wiki.rsyslog.com/index.php/FailoverSyslogServer

1.  Is it possible to configure Rsyslog to pick randomly from a list
of remote syslog destinations instead of following the listed order?
If I have a large population of log generating hosts, I'd like to be
able to distribute the total load amongst two or more "central" log
receivers, with failover if a receiver dies. Assuming that we have a
sound random-selection implementation, and a large enough population
of hosts generating approximately equal amounts of log data, I should
get an approximately even distribution across my receiver hosts. In
the event that a receiver dies, its clients would randomly try another
receiver, keeping the load close to even. Is this possible, with the
existing failover mechanism?

2.  How does Rsyslog handle a round-robin DNS hostname (i.e., a single
A record resolves to multiple IP addresses), if that hostname is a
remote log destination? Does it just call gethostbyname()?

3.  If Rsyslog does just call gethostbyname(), how does it handle
multiple occurrences of the same hostname in the config file? Does it
call gethostbyname() once for each instance of a duplicate name, or
does it perform a single lookup and use the same result for all?

4.  When and how does the failover code check for remote destination
failures? Does it re-check failed servers after they're declared
"failed"? If so, does Rsyslog re-check a failed server every time it
sends a log message, or does it have some kind of polling timeout, or
is the check interval determined by something else entirely? When a
server fails, how long does it take Rsyslog to notice the failure and
start re-directing messages to the next host? When a server recovers,
does Rsyslog automatically notice the recovery, and if so, how long
does it take to start re-directing messages to the primary host?

In case anybody's wondering, I've include some examples of what I'm
thinking of trying, here. Assume that there are three (3) central log
receiver hosts, with each with a unique DNS name ('log-alfa',
'log-bravo', and 'log-charlie') that points to its own IP. Also,
assume that the round-robin DNS name 'log' points to all three hosts'
IP addresses, and that all of my log-generating hosts (the clients)
are configured to pick a random IP when calling gethostbyname() on a
round-robin name.

    *.* @@log
    $ActionExecOnlyWhenPreviousIsSuspended on
    & @@log
    & @@log
    & /var/log/localbuffer
    $ActionExecOnlyWhenPreviousIsSuspended off

But if Rsyslog resolves all three instances of the name 'log' with a
single call to gethostbyname(), this won't work. I could probably hack
around it with some additional round robin entries similar to 'log',
but named 'log-all-1', 'log-all-2', 'log-all-3', etc., to force an
independent lookup attempt for each:

    *.* @@log
    $ActionExecOnlyWhenPreviousIsSuspended on
    & @@log-all-1
    & @@log-all-2
    & @@log-all-3
    & @@log-all-4
    & @@log-all-5
    & @@log-all-6
    ...
    & /var/log/localbuffer
    $ActionExecOnlyWhenPreviousIsSuspended off

But this is pretty messy. First, there's a lot of extra DNS records to
maintain, which is a real pain. Second, I have to use a lot of extra
round-robin names (definitely more than 3) to maximize the probability
that gethostbyname() will return at least one working server before
reaching the end of the list. (gethostbyname() isn't aware of downed
hosts, nor does it enforce balance in its responses, so it might
return the same bad server several times in a row.)

It would nice to have a random variant of the
'$ActionExecOnlyWhenPreviousIsSuspended' functionality. As a
hypothetical example I just cooked up off the top of my head:

    *.* @@log-alfa
    $ActionExecPickRandom on
    $ActionExecPickRandomRetryWait 1
    $ActionExecPickRandomRetryLimit -1
    & @@log-bravo
    & @@log-charlie
    $ActionExecPickRandom off

where Rsyslog randomly selects one of log-alfa, log-bravo, and
log-charlie, initially, and then makes another random pick at
failover, etc. I can think of all sorts of nifty options and
configurable knobs that would come in handy, here, too.

I'm planning to try out my first two examples, tomorrow, to see what
works. If anybody has any comments, I'd love to hear them.

Ryan B. Lynch
[email protected]
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

[rsyslog] Multiple-server failover questions (and round-robin DNS resolution, too)

Reply via email to