Interesting... I ran into a similar network meltdown issue last year during a 6509 to 6509-E migration effort. Fairly small switching (servers only, no connected end users) network compounded by the use of blade server ESMs and interim 4948 switches used during the migration.
Network ran fine for a day, then went down. The 6509-E switches logged duplicate HSRP address messages which after research, indicated a loop. Breaking a couple of redundant links by unplugging them brought things back to operational. Good thing that the diagram of the site included those redundant links clearly marked. Far as I can tell, either a blade server chassis did something to cause the outage (as one of them was behaving erratically prior to the outage) or the network diameter became a bit larger than the recommended number due to the interim 4948 switches being daisychained together. I haven't had much experience with meltdowns but I do believe that a meltdown usually occurs at a specific point in time caused by one or more immediate factors rather than something that just gradually builds up over a long period (24 hrs or > maybe) of time. That would lead me to believe the blade server ESM had a part in it but that's just an unproven hypothesis. I did find that famous large scale hospital? meltdown while looking up STP loops though and that made for good reading. Vijay Ramcharan -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of matthew zeier Sent: March 19, 2008 11:49 To: [email protected] Subject: [c-nsp] Cisco 3020 blade switches hung, HLFM errors, network meltdown? I have an HP blade system with two WS-CBS3020-HPQ switches. Console logged the following error during which the entire network was unreachable: (6444)msecs, more than (2000)msecs (719/326),process = HLFM address learning process. -Traceback= 4794B0 479A4C 4799B0 2E9E64 4F788C 32D6C4 11B980 11BEF0 11D684 326BC4 322F90 323240 A86D34 A7D2FC 18w0d: %SYS-3-CPUHOG: Task is running for (2152)msecs, more than (2000)msecs (143/1),process = HLFM address learning process. With their uplinks to the network disabled, the switches were still unreachable/unusable, even through Fa0/0. I had to reboot each before I could telnet back in. Disconnecting them from the network brought the network back, reconnecting melted the network. Felt like a broadcast storm or even a spanning-tree loop but I'd be surprised if it was the latter and the upstream switches, two 6500s, didn't know how to do deal with that (heck, they deal with HP 2510s that default to not running spanning-tree). From some of the log entries I could gleam from the console buffer, it looks like the native vlan on one of the port channel members was inadvertently changed and was marked as incompatible with the other bundle member. Still, I'm somewhat surprised that that hung the blade switches to the extent that everything else became unusable. Any insights? [I'm stuck in this place where the WS-CBS3020-HPQ's aren't registered with CCO and my reseller says I have to talk to HP for support...] _______________________________________________ cisco-nsp mailing list [email protected] https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/ _______________________________________________ cisco-nsp mailing list [email protected] https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/
