Contegix Network Incident Report

Contegix Notifications Mon, 06 Jul 2009 00:47:34 -0700

Contegix Customer:

Please do not reply to this email.  If you have any questions, please submit a 
support request to [email protected].

At approximately 11:39 AM on July 2nd, our NOC engineers began to receive several monitor alarms alerting us of a potential network issue.We found our core switches were dropping packets to both internal and external traffic.

We began to investigate and found abnormal traffic lights on one of our intrusion prevention systems. At that time, we believed this to bethe cause and physically bypassed the units. We quickly determined that this was not the root cause and the problem still persisted. We thenbegan to troubleshoot in our core switching.

At approximately 11:59 AM, we determined there was a multicast packet storm on our network. Due to the high number of packets, the CPUs inboth core switches reached max capacity which caused packet loss. After further debugging we found that the storm was from a routingprotocol (VRRP-E) multicast IP and originating from a specific customer core switch port. The customer connected to this port had had aswitch malfunction a few minutes prior to the network issue and we determined this could be the cause. At approximately 12:05 PM, wedisabled the customer port and the CPUs on our core switches began to stabilize.

Network availability to internal and external destinations were restored, but we found that we still could not reach a few externaldestinations. Also, traffic was increasing on our network but not at normal utilization. After further troubleshooting, we found that wecould not route out Level(3)’s network. Based on our observations and data, we could not determine the reason for the Level(3) issues. Atapproximately 12:19 PM, we disabled BGP with Level(3). Once this was disabled, our network returned to normal and traffic flowed through tooutbound routes correctly.

While the issue started when a customer replaced a switch, we do not believe this is the direct cause. We do suspect that it triggered a bugin our core switch software despite all engineered precautions. We are working closely with the hardware manufacturer to determine theexact cause. We will forward any new information on this issue and long-term resolution. In the interim, we have placed a moratorium onadding new customer switching equipment connected to our core switches. In addition, we restored our BGP session with Level(3) once it wasdetermined to be safe.

We apologize for any inconvenience this may have created for you or your customers. Our reliable network is one of our great assets, and weplace a great deal of emphasis on making sure it is working optimally. As mentioned before, we are working closely with the switchmanufacturer to identify and fix this bug to make sure this does not occur again.



Sincerely,
Contegix Support

---
Contegix
900 Walnut Street
Suite 700
Saint Louis, MO  63102
Phone: 314.622.6200 ext. 3
Toll Free: 877.4.CONTEGIX ext. 3
Fax: 314.621.4422
E-mail: [email protected]
Beyond Managed Hosting(r) for Your Enterprise
Favorite Linux-Friendly Hosting Company - Linux Journal
http://www.contegix.com/linuxjournal

Contegix Network Incident Report

Reply via email to