Except in those (becoming less rare than hardware failure) instances where the software controlling the failover process is the actual cause of the outage.
Owen On Jun 23, 2011, at 5:44 AM, -Hammer- wrote: > Agreed. At an enterprise level, there is no need to risk extended downtime to > save a buck or two. Redundant hardware is always a good way to keep Murphy > out of the equation. And as far as hardware failures go, it's not that > common. Nowadays it's the bugs in overly complicated code on your gear that > get you first. I miss IOS 11.3..... > > -Hammer- > > > > On 06/23/2011 01:07 AM, Bret Palsson wrote: >> That's fine if you are running a website. When it comes to >> telecommunications, a 15 minute outage is pretty huge. Especially with >> certain types of customers: emergency services for example. >> >> -Bret >> >> On Jun 23, 2011, at 12:02 AM, Hank Nussbacher wrote: >> >> >>> At 20:42 22/06/2011 -0700, Jason Roysdon wrote: >>> >>> Let me be a bit of a heretic here. How often does your router fail? Or >>> your firewall? In the 25 years I have gone into customers I have found >>> when they did a cross setup as proposed below by Bret and Jason, only one >>> person truly knew the complete setup and if something broke only he was >>> able to fix it. There is never complete printed documentation: routing >>> design, IPs on all interfaces, subnetting schematic, etc. And if there was >>> at one point, after 2 years it was outdated and never updated and only the >>> *1* guy knew the changes in his head. >>> >>> In that kind of situation, when something stopped working they always had >>> to call in the "guru" to fix it. On the other hand, a simple design of >>> only *one* path (pick either left or right side of each of the ASCII arts), >>> made it possible that even junior network engineers as well as technicians >>> called in on emergency with 4 hours notice, were able to fix the situation >>> much more quickly than the "cross" design. And the MTBF on a single path >>> solution, IMHO, is around 3-4 years. And if you need redundancy, keep a >>> spare box on a shelf, completely loaded with the latest config so that it >>> can be hot-swapped in within 15 minutes of failure. >>> >>> This 1-path design is not for everyone. The vendors always recommend the >>> "cross" design since they sell 2x the amount of boxes but I have found that >>> life works fine with just a 1-path design as well. >>> >>> -Hank >>> >>> >>> >>>> I second the static routes, specially from a simplicity standpoint. Add >>>> in a pair of layer two switches to simplify further: >>>> >>>> >>>> +--------+ +--------+ >>>> | Peer A | | Peer A |<-Many carriers. Using 1 carrier >>>> +---+----+ +----+---+ for this scenario. >>>> |eBGP | eBGP >>>> | | >>>> +---+----+iBGP+----+---+ >>>> | Router + + Router |<- Routers. Not directly connected >>>> +-+------+ +------+-+ >>>> | | >>>> +-+------+ +------+-+ >>>> |L2Switch|----|L2Switch|<- Layer 2 switches, can be stacked >>>> +--------+ +--------+ >>>> | | >>>> +-+------+ +------+-+ >>>> |Act. FW |----|Pas. FW |<-Firewalls Active/Passive. >>>> +--------+ +--------+ >>>> >>>> You can lose all of the left leg, or all of the right leg, and still be >>>> up. If you want to complicate things, you can add crossing links >>>> between it all, but again, beyond BGP and VRRP, this is a very simple >>>> design you can easily troubleshoot at 3AM. It's also much easier to >>>> document the troubleshooting steps (so you can go on vacation and >>>> someone else can solve without calling you) and test upgrades. >>>> >>>> You can nearly evenly split the traffic by having a VRRP VIP on each >>>> edge router, with the other router backing up the first. The firewalls >>>> can have two static routes, one to each VIP, and this will roughly >>>> load-balance the traffic out on a packet basis. As you peer with the >>>> same ISP, this will work just fine. If they have an outage, your edge >>>> routers will learn, and even if the circuit drops it'll know, and >>>> basically the VIP will just redirect traffic to the other router. >>>> >>>> Now all your firewalls have to do is maintain stateful session >>>> information, not OSPF. >>>> >>>> If you had two different ISPs (especially if they are not roughly evenly >>>> connected), then not having intelligence of the BGP paths in your >>>> firewalls can cause an extra hop when it hits router with the longer >>>> path, which will redirect it to the router with the shorter path. >>>> >>>> Speaking from a Cisco/HSRP point of view, you could be more intelligent >>>> (re:more complicated, and complication means harder troubleshooting and >>>> more documentation needed) during problem periods by having the VIP move >>>> routers automatically based on the WAN link dropping and/or a route >>>> beyond it being lost (others can comment to if VRRP supports this). >>>> This would save one hop to the "broken" router when the BGP path or WAN >>>> is down. >>>> >>>> Jason Roysdon >>>> >>>> On 06/22/2011 06:07 PM, Bret Palsson wrote: >>>> >>>>> On Wed, Jun 22, 2011 at 5:33 PM, PC<[email protected]> wrote: >>>>> >>>>> >>>>>> Who makes the firewall? >>>>>> >>>>>> >>>>>> >>>>> Juniper SSG. We use NSRP and replicate all the RTOs. We have hitless on >>>>> the >>>>> Firewalls, have for years. We're now peering with our own carriers vs. >>>>> using >>>>> our datacenter's mix. >>>>> >>>>> A static route from the junipers to the VIP (VRRP) is probably the way to >>>>> go. I think. >>>>> >>>>> To make this work and be "hitless", your firewall vendor must support >>>>> >>>>>> stateful replication of routing protocol data (including OSPF). For >>>>>> example, Cisco didn't support this in their ASA product until version >>>>>> 8.4 of >>>>>> code. >>>>>> >>>>>> Otherwise, a failover requires OSPF to re-converge -- and quite frankly, >>>>>> will likely cause some state of confusion on the upstream OSPF peers, >>>>>> loss >>>>>> of adjacency, and a loss of routing until this occurs. It's like someone >>>>>> just swapped a router with the same IP to the upstream device -- >>>>>> assuming >>>>>> your active/standby vendor's implementation only presents itself as one >>>>>> device. >>>>>> >>>>>> However, once this is succesful your current failover topology should >>>>>> work >>>>>> fine -- even if it takes some time to failover. >>>>>> >>>>>> In my opinion though, unless the firewall is serving as "transit" to >>>>>> downstream routers or other layer 3 elements, and you need to run OSPF >>>>>> to it >>>>>> (And through it) as a result, it's often just easier to static default >>>>>> route >>>>>> out from the firewall(s) and redistribute a static route on the upstream >>>>>> routers for the subnets behind the firewalls. It also helps ensure >>>>>> symmetrical traffic flows, which is important for stateful firewalls and >>>>>> can >>>>>> become moderatly confusing when your firewalls start having many >>>>>> interfaces. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 22, 2011 at 4:27 PM, Bret Palsson<[email protected]> wrote: >>>>>> >>>>>> >>>>>>> Here is my current setup in ASCII art. (Please view in a fixed width >>>>>>> font.) Below the art I'll write out the setup. >>>>>>> >>>>>>> >>>>>>> +--------+ +--------+ >>>>>>> | Peer A | | Peer A |<-Many carriers. Using 1 carrier >>>>>>> +---+----+ +----+---+ for this scenario. >>>>>>> |eBGP | eBGP >>>>>>> | | >>>>>>> +---+----+iBGP+----+---+ >>>>>>> | Router +----+ Router |<-Netiron CERs Routers. >>>>>>> +-+------+ +------+-+ >>>>>>> |A `.P A.' |P<-A/P indicates Active/Passive >>>>>>> | `. .' | link. >>>>>>> | :: | >>>>>>> +-+------+' `+------+-+ >>>>>>> |Act. FW | |Pas. FW |<-Firewalls Active/Passive. >>>>>>> +--------+ +--------+ >>>>>>> >>>>>>> >>>>>>> To keep this scenario simple, I'm multihoming to one carrier. >>>>>>> I have two Netiron CERs. Each have a eBGP connection to the same peer. >>>>>>> The CERs have an iBGP connection to each other. >>>>>>> That works all fine and dandy. Feel free to comment, however if you >>>>>>> think >>>>>>> there is a better way to do this. >>>>>>> >>>>>>> Here comes the tricky part. I have two firewalls in an Active/Passive >>>>>>> setup. When one fails the other is configured exactly the same >>>>>>> and picks up where the other left off. (Yes, all the sessions etc. are >>>>>>> actively mirrored between the devices) >>>>>>> >>>>>>> I am using OSPFv2 between the CERs and the Firewalls. Failover works >>>>>>> just >>>>>>> fine, however when I fail an OSPF link that has the active default >>>>>>> route, >>>>>>> ingress traffic still routes fine and dandy, but egress traffic doesn't. >>>>>>> Both Netiron's OSPF are setup to advertise they are the default route. >>>>>>> >>>>>>> What I'm wondering is, if OSPF is the right solution for this. How do >>>>>>> others solve this problem? >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Bret >>>>>>> >>>>>>> >>>>>>> Note: Since lately ipv6 has been a hot topic, I'll state that after we >>>>>>> get >>>>>>> the BGP all figured out and working properly, ipv6 is our next project. >>>>>>> :) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>> >> >>

