On Sep 25, 2005, at 8:30 AM, Neil wrote:
Yep, the same behavior when the master dies. The solution that the
person in #pf told me is use routing but I don't know how to
implement. He told me that it's an issue in pf's NAT.
Bullshit.
Ok, here is the layman's description of the problem and the practical
solution(s) to it. I'd love to be able to explain why interfaces
recovering from INIT don't reclaim MASTER faster than they do (approx
30 seconds in my tests), but I don't understand the code-level
logistics of everything. Hint: This is only a problem using single
CARP hosts with preemption.
PROBLEM:
With a simple CARP design using a single CARP host on each segment
and preemption enabled, failover occurs as expected in the case of
any system offline condition (server crashes, admin reboots, etc).
If a single interface goes from MASTER to INIT state (cable gets
pulled, cable goes bad, card goes bad, etc), the 2nd interface on
that system will go into BACKUP mode as expected. Traffic will route
across the new MASTER, and will continue to do so while the failed
system is in an INIT/BACKUP state.
However, if the failed interface returns from INIT to an available
mode (we plug the cable in), we notice that the 2nd interface
reclaims MASTER almost immediately, but the restored interface does
not. It becomes a BACKUP host, which leaves us with a routing
impossibility:
BACKUP MASTER
carp0 carp0
| |
host1 host2
| |
carp1 carp1
MASTER BACKUP
Any internal clients will attempt to send traffic through the "new
gateway" (host1), although neither system has any way of routing the
traffic properly (not without some hokey static routes bypassing the
CARP hosts). NOTE: I have found that the original MASTER does
indeed return to the correct state, approximately 30 seconds later.
This is reproducible, but YMMV.
SOLUTION:
1) If you really are concerned about a partial system failure
(unplugged cable, bad card, etc), then scrap the single CARP host/
segment design and use arpbalance with multiple CARP hosts. The same
partial-failure test using 2 CARP hosts on each segment with
arpbalance resulted in a perfect failover and recovery with no packet
loss.
2) This is not tested, but I suspect that you should be able to use
the new interface grouping features in 3.8 to simply assign multiple
physical interfaces to the same group. Even if one fails, the other
*should* maintain the MASTER state and avoid any partial failure
consequences. I'd love to hear from other users or developers that
have tried the grouping feature in this sort of scenario.
--
Jason Dixon
DixonGroup Consulting
http://www.dixongroup.net