On Wed, Aug 15, 2007 at 12:06:36PM +0800, Chengchen Hu wrote:
> I find that the link recovery is sometimes very slow when failure occures 
> between different ASes. The outage may last hours. In such cases, it seems 
> that the automatic recovery of BGP-like protocol fails and the repair is took 
> over manually. 
> 
> We should still remember the taiwan earthquake in Dec. 2006 which damaged 
> almost all the submarine cables. The network condition was quit terrible in 
> the following a few days. One may need minutes to load a web page in US from 
> Asia. However, two main cables luckly escaped damage. Furthermore, we 
> actually have more routing paths, e.g., from Asia and Europe over the 
> trans-Russia networks of Rostelecom and TransTeleCom. With these redundent 
> path, the condition should not be that horrible.

Please see the presentation I made at AMSIX in May (original version by Todd at 
Renesys): http://www.thedogsbollocks.co.uk/tech/0705quakes/AMSIXMay07-Quakes.ppt

BGP failover worked fine, much of the instability occurs after the cable cuts 
as operators found their networks congested and tried to manually change to new 
uncongested routes.

(Check slide 4) - the simple fact was that with something like 7 of 9 cables 
down the redundancy is useless .. even if operators maintained N+1 redundancy 
which is unlikely for many operators that would imply 50% of capacity was 
actually used with 50% spare.. however we see around 78% of capacity is lost. 
There was simply to much traffic and not enough capacity.. IP backbones fail 
pretty badly when faced with extreme congestion.


> And here is what I'd like to disscuss with you, especially the network 
> operators,
> 1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly 
> because the policy setting by the ISP and network operators?

No, BGP was fine.. this was a congestion issue - ultimately caused by lack of 
resiliency in cable routes in and out of the region.

> 2. What is the actions a network operator will take when such failures 
> occures? Is it the case like that, 1)to find (a) alternative path(s); 
> 2)negotiate with other ISP if need; 3)modify the policy and reroute the 
> traffic. Which actions may be time consuming?

Yes, and as the data shows this only made a bad situation worse.. any routes 
that may have had capacity were soon overwhelmed.

> 3. There may be more than one alternative paths and what is the criterion for 
> the network operator to finally select one or some of them?

Pick one that works? But in this case no such option was available. 

> 4. what infomation is required for a network operator to find the new route?  

In the case of a BGP change presumably the operator checks that the new path 
appears to function without latency or delay (a traceroute would be a basic way 
to check). 

In terms of a real fix, it cant be done with BGP, you would need to find unused 
Layer1 capacity and plug in a new cable. Slides 28-31 show that this occurred 
with Asian networks picking up Westward paths to Europe but it took some manual 
intervention, time, and money.


I think the real question given the facts around this is whether South East 
Asia will look to protect against a future failure by providing new routes that 
circumvent single points of failure such as the Luzon straights at Taiwan. But 
that costs a lot of money .. so the futures not hopeful!

Steve

Reply via email to