On Sep 25, 2005, at 9:39 PM, Jason Dixon wrote:
On Sep 25, 2005, at 8:30 AM, Neil wrote:
Yep, the same behavior when the master dies. The solution that the
person in #pf told me is use routing but I don't know how to
implement. He told me that it's an issue in pf's NAT.
Bullshit.
Ok, here is the layman's description of the problem and the
practical solution(s) to it. I'd love to be able to explain why
interfaces recovering from INIT don't reclaim MASTER faster than
they do (approx 30 seconds in my tests), but I don't understand the
code-level logistics of everything. Hint: This is only a problem
using single CARP hosts with preemption.
PROBLEM:
With a simple CARP design using a single CARP host on each segment
and preemption enabled, failover occurs as expected in the case of
any system offline condition (server crashes, admin reboots, etc).
If a single interface goes from MASTER to INIT state (cable gets
pulled, cable goes bad, card goes bad, etc), the 2nd interface on
that system will go into BACKUP mode as expected. Traffic will
route across the new MASTER, and will continue to do so while the
failed system is in an INIT/BACKUP state.
However, if the failed interface returns from INIT to an available
mode (we plug the cable in), we notice that the 2nd interface
reclaims MASTER almost immediately, but the restored interface does
not. It becomes a BACKUP host, which leaves us with a routing
impossibility:
I agree a routing impossibility. Last week I built a lab to test/
build a new HA firewall. In my testing I did not see the 30 second
delay people are reporting. Both carp interfaces on the primary
would take over as MASTER within seconds of bringing the 'failed'
physical interface back online.
I started a large file download over http with everything running
through the primary firewall. I then pulled a cable and watched the
download of the file, it slowed slightly, but went right back to
previous speed. (Like your scp demo at NYCBSDCON.) I actually
disconnected and reconnected the cable a bunch of times and the
download never stopped.
I did notice one strange thing. I have 3 physical interfaces and two
carp interfaces on each firewall. I noticed that if I was pinging
the external/carp0 address and failed things over, say by doing
'ifconfig rl0 down' the ping would continue with zero packet loss.
If I do that same thing on the internal/carp1, I see a small amount
of packet loss. I don't really care about that since most clients/
people are not going to notice. I've already tested and know that
downloads and other such things continue to work without a problem.
I found it strange that carp0 would not have a packet loss while
carp1 would. I did not investigate the packet loss further to know
if maybe it was the hub/switch combo I'm using on the inside vs
external.
BACKUP MASTER
carp0 carp0
| |
host1 host2
| |
carp1 carp1
MASTER BACKUP
Any internal clients will attempt to send traffic through the "new
gateway" (host1), although neither system has any way of routing
the traffic properly (not without some hokey static routes
bypassing the CARP hosts). NOTE: I have found that the original
MASTER does indeed return to the correct state, approximately 30
seconds later. This is reproducible, but YMMV.
SOLUTION:
1) If you really are concerned about a partial system failure
(unplugged cable, bad card, etc), then scrap the single CARP host/
segment design and use arpbalance with multiple CARP hosts. The
same partial-failure test using 2 CARP hosts on each segment with
arpbalance resulted in a perfect failover and recovery with no
packet loss.
2) This is not tested, but I suspect that you should be able to use
the new interface grouping features in 3.8 to simply assign
multiple physical interfaces to the same group. Even if one fails,
the other *should* maintain the MASTER state and avoid any partial
failure consequences. I'd love to hear from other users or
developers that have tried the grouping feature in this sort of
scenario.
Can you share where one might read more about the interface grouping
features of 3.8?
I'm using a snapshot from September 10th in my lab.
-Chad