Re: [networking-discuss] Design review for the VRRP project

Huafeng Lu Fri, 31 Oct 2008 03:25:58 -0700

于 2008年10月31日 02:56, James Carlson 写道:
> Huafeng Lu writes:
>> ??? 2008???10???28??? 13:10, Huafeng Lu ??????:
>>> I think it is.  SMF potentially solves some interesting issues,
>>> including:
>>>
>>>   - It allows administrators to specify what applications are the
>>>     dependencies for VRRP's control of an address.  As a user, I can
>>>     add a dependency in my http:apache2 instance saying that
>>>     vrrp:apache2 should not come on-line unless Apache is running.  If
>>>     the server fails, the system takes vrrp:apache2 off-line, and we
>>>     get the expected fail-over.
>> Hi, Jim,
>>
>> The dependency can be considered some kind of "health check" mechanism 
>> when VRRP is used to protect a service. I think it's the biggest 
>> advantage of factoring VRRP into SMF. When VRRP is used to handle host 
>> failure (the "standard" goal of VRRP, as specified in the RFC), no 
>> dependency is needed.
> 
> The RFC talks about protecting the forwarding service ... where does
> it talk about host failure?


To protect the forwarding service, VRRP on the backup host listens for
the heartbeat from the peer (master) host. If VRRP doesn't get heartbeat
for a period of time, it'll move the forwarding service from the peer
host to this host.

In most cases, "not hearing heartbeat" means the peer host failure. Of
course it could be an interface failure or link failure, but the VRRP
protocol is unable to distinguish them.
> 
>> But there's a problem. Take the http service as an example. As you 
>> pointed out, vrrp:apache2 depends on http:apache2, so VRRP should run 
>> after the web server is started. On the other hand, VRRP is used to 
>> protect the the VRIP that is shared by the master and backup hosts. 
>> Before VRRP is running, the VRIP may not appear on the host, so the web 
>> server application may not be able to run. (Some application may bind to 
>> INADDR_ANY, but we cannot assume how the applications are designed.)
> 
> I think that issue is intimately tied up with another problem: dealing
> with the "no-receive-but-sometimes-receive" behavior on the non-owning
> system for normal (routing) VRRP.
> 
> It would be nice if both systems behaved normally, meaning that the IP
> address is configured on both systems and the application binds to
> that address without trouble.  On the backup system, the application
> just sits there idle and doesn't see requests because VRRP has
> disabled that logic.
> 
> The reason I suggest doing it that way is that you end up with fewer
> moving parts in the end -- fewer components are involved in the actual
> fail-over process -- and fewer parts means that there's less to go
> wrong.  You don't have to answer the inevitable questions about
> switching, such as "what if I fail over, and then the application is
> unable to bind to the address?"


Understood. Using some method (being discussed but not determined), the
VRIP can be configured on both hosts, but the backup system doesn't send
or respond to ARP, ND or RA messages, so that the outside world is not
aware of its existence. Thus, the applications can start, but traffic
only goes to the master host.

But there's still a problem. VRRP must run to set up the VRIP before
applications can bind to the VRIP, so semantically the app depends on
VRRP. When the application is running, VRRP need to depend on it to
protect it. Thus, VRRP and the application depend on each other. Seems
it's not easy for SMF to handle such dependencies.

We may try:

Method 1: In http:apache2's start method, first create vrrp:apache2
(with no dependencies) to set up the IP, then start the httpd program,
then add the dependency to vrrp:apache2. This probably won't work, since
all these are done in the start method, and before the start method
exits the vrrp:apache2 service (logically) shouldn't be considered
"online", thus if we add the dependency to vrrp:apache2, since
http:apache2 is still not "online", logically vrrp:apache2 shouldn't
work normally.

Method 2: does all necessary setup and cleanup in vrrpadm. In "vrrpadm
create", first create the VRRP instance and the SMF vrrp:apache2
instance (which depends on http:apache2). Vrrp:apache2 should be in the
offline state since its dependency (http:apache2) is not ready, but the
VRRP instance can stay in the INIT state so that the VRIP can be set up.
Next start the http:apache2 service; this can start successfully since
the VRIP is ready. Finally send a "startup" event to the VRRP instance
to bring vrrp:apache2 on-line. This should probably work, but looks
quite like a hack.

If a VRRP instance is not factored as a SMF service instance, things
would be easier:

Method 3 (modification to method 1): VRRP instance is not SMF instance.
We can change http:apache2's start method to first create the VRRP
instance to set up the VRIP and then start apache; similarly, in the
stop method, destroy the VRRP instance after killing apache. Thus, if
http:apache2 is killed by accident or disabled intentionally using
"svcadm disable", the stop method is called to destroy the VRRP
instance. (When killed by accident, SMF will probably call the start
method to bring it online again. The time gap between VRRP instance
destroying and recreation should be short enough so that the peer won't
become MASTER.)

If we want to use a VRRP instance to protect multiple services that are
using the same IP address, we can add the same "vrrpadm create" command
to their start method, although only one of them will successfully
create the instance (the other will fail with "object already exists").
Similarly, all stop methods have the same "vrrpadm destroy" command,
thus the first failure of these services will cause the fail-over.

This solution looks cleaner. Thus, VRRP instances can be divided into
two types:

(1) To handle host failure. They're persistent; their configuration is
stored in the configuration file. They will be created after system boot
when svc:/network/vrrp:default is started.

(2) To protect certain service(s). These should be temporarily created
by the services. "Temporarily", because they're associated with the
services, so should be created and stopped by the services. Their
configuration are not stored int the configuration file and won't be
started by vrrp:default.

> Plus, it breaks a feedback loop: it looks like the health of the
> service affects whether VRRP can fail over by changing its state, and
> VRRP failing over forces the service to change state.  It'll be hard
> to show that a system like that doesn't have obscure failure modes.
> 
> If the architecture is tied to VNICs (and there's no plan to separate
> that), then that gives us a potential handle for controlling how the
> system responds.  There appear to be at least two separate switches
> here: allow low-level traffic in and out (for ARP, ND, RDISC), and
> allow application-level traffic in and out.
> 
> Just a thought.  I think there may be other ways to implement this as
> well.
> 
>> Note, in most cases, people will use a real IP address that is on the 
>> master host as the VRIP, so the application may run on the master. But 
>> it's also possible that the VRIP is not actually on either host.
> 
> If it's not on either host, then what does VRRP do?

Sorry I should've stated it clearer.
By "not on either host", I mean the VRIP is not a "real" IP address that
is already configured on a host. When VRRP is in control, the VRIP will
appear on one of the hosts.

>> Further, the application should run normally on both hosts (whether 
>> backup or master), and when the backup becomes master, the new-master 
>> can take the role naturally. But the VRIP can be on only one of the 
>> hosts at a given time. So even VRRP runs before the application starts, 
>> it won't work. A possible solution is to register external scripts to 
>> start the application when VRRP enters the MASTER state, and to stop the 
>> application when it leaves. But this looks too cumbersome.
>>
>> So it looks like VRRP is not suitable to protect such local services.
> 
> I think that not protecting local services means that the "high
> availability" story has a pretty big hole.  It's able to protect
> against the relatively unlikely event of a previously-working but
> now-failed Ethernet port, but not against the more likely failure of
> software.
> 
> Maybe I don't understand enough about the marketing requirements for
> VRRP, but I thought that part of the point of VRRP was to protect the
> service and not just the interfaces.

My understanding is that, VRRP is originally designed for routers
(forwarding service), thus whether it can be used for other purposes
really depends on the implementation. Yesterday I thought the VRIP can
only appear on one host (according to the current design), so it's
unable to protect a service; but if we can have VRIP configured on both
hosts (although the address on the backup host is not seen by the
outside world, as you suggested), VRRP can be used to protect services.
Please correct me if I'm wrong.

Thanks.
--
Huafeng
_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] Design review for the VRRP project

Reply via email to