I have no VPLS in my network... Yet. 16T is a 16 port TenG blade, yes.
-- Sent from my mobile device On 2013-03-15, at 1:03 PM, "Aaron" <[email protected]> wrote: > Another commonality the tac pointed out to me amongst my me's that crashed > is that they are all running the l2vpn vpls address family. > > What's 16T? ...16 Ten gig ? > > Aaron > > > -----Original Message----- > From: Jason Lixfeld [mailto:[email protected]] > Sent: Friday, March 15, 2013 10:01 AM > To: Aaron > Cc: [email protected] > Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted > at once!! > > Interesting. I just checked my archives and I have had two instances where > LCs have rebooted due to that same error. XR versions spanned 4.2.0 - > 4.2.3. You are running older code than I am. Both instances of my LCs > f**king off were on two separate ASR9Ks and actually the first time was a > 2/20 (on 4.2.0) the second time was a 16T (on 4.2.3) on Jan. 1 (Happy New > Year to me! :|) > > SRs 622594207 and 624325505. Cards were RMAd both times. > > 15.3(1)S has been out since November and at the time of the LC crash on > January 1, I only had 1 ME3600 deployed running 15.3(1)S. It has been up > for 100 days, so it lasted beyond the LC crash. > > At this point, I'm more interested in the "theory" TAC has about the > 15.3(1)S bug that they think might have triggered the reboots. If you can > pass me the SR or drop me a note when you find out one way or the other, I'd > be grateful. Also, if 15.3(1)S1 fixes that bug, that would be good > information as well. > > On 2013-03-15, at 10:06 AM, "Aaron" <[email protected]> wrote: > >> 2 tac cases opened...one with ios team for me3600's and one opened >> with ios xr team.... >> >> Ios Cisco tac is still investigating (they want more crashinfo's and >> running configs from me).... but thus far I have been told that my >> 2/20 linecard in my asr9010 reloaded due to a double bit error (double >> ecc (I believe is error correcting code)). Syslogs and cli output below. >> >> Ios xr cisco tac team says that he recommends replacing linecard >> if/when it happens a second time >> >> Ios Tac eng said that when a bit changes in memory, it's correctable, >> but when two bits change then it's uncorrectable and a reload on that >> linecard occurs. Ios Tac eng said that the lincecard in the asr9k >> seems to have crashed prior to the me3600's reloading. This seems to >> be seen also in that the syslog messages regarding the bgp down >> messages with those me3600's started happening a few minutes after >> 14:22:38 (when the asr9k linecard crashed)....i think bgp keepalives >> default to 60 seconds and a bgp neighbor session doesn't time out >> until 180 seconds ( I think 3*keepalives) >> >> Here is the cli output for that card ... Last Reset : >> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID: >> 155724 (prm_server) : Thu Mar 14 19:24:00 2013 >> >> Did you see that process id number ? 155724.....you will also see >> that pid in the syslog messages. >> >> That's when the asr9k linecard reloaded and seems to have caused (13) >> of my me3600's to reboot! These 13 me3600's are as follows.... >> >> All run 15.3(1)S. they are scattered throughout my network...sparsely >> located here and there....no real physical commonality among them. >> All of these 13 me3600's run Mp-iBGP with dual RouteReflectors....one >> of the RR's is on that asr9010. This mpibgp is for mpls l3vpn's. the >> pe-ce on the me3600's is directly connected routing...that's it. The >> pe-ce in my core to connect to my legacy ip net is ospf from dual pe-ce > feeds for redundancy. >> The pe-ce dual links are between dual asr9k/7609-s pairs.....the >> asr9k's are in fact the dual rr's also. One of them is that asr9010 >> that had a lincecard crash. Speculation I heard from ios tac >> yesterday reqarding the >> me3600 crash was maybe related to a cef route change bug in 15.3(1)S. >> seems that perhaps when the asr9010 linecard crashed, the several >> hundred routes learned via that pe-ce connection to the legacy 7609 >> propogated over the l3vpn and into the me3600's, thus causing them to >> do cef/fib convergence and possible converge over to the other >> asr9k/7609 location....BUT this is only speculation about that being the > cause of the me3600 reloads for now.... >> more on that to come later hopefully from ios tac when I give them >> more crashinfo's and running configs... >> >> Bare in mind, I have (4) more me3600's config'd same way as the >> crashed ones and the DID NOT reboot....those (4) run 15.2.2S or >> 15.2.4.S1 >> >> Syslog messages... >> >> 2013-03-14 14:22:38 Local7.Emerg 9k 16328: LC/0/1/CPU0:Mar 14 >> 14:24:00.733 : pfm_node_lc[267]: %PLATFORM-NP-0-HW_DOUBLE_ECC_ERROR : >> Set|prm_server[155724]|Network Processor Unit(0x1007001)|NP DOUBLE ECC >> ERROR, NP=1, memId=18, subMemId=0x1 >> 2013-03-14 14:22:38 Local7.Emerg 9k 16329: LC/0/1/CPU0:Mar 14 >> 14:24:00.736 : pfm_node_lc[267]: %PLATFORM-PFM-0-CARD_RESET_REQ : >> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID: >> 155724 (prm_server), Fault Sev: 0, Target node: 0/1/CPU0, CompId: >> 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault Reason: NP DOUBLE >> ECC ERROR, NP=1, memId=18, subMemId=0x1 >> 2013-03-14 14:22:38 Local7.Critical 9k 16330: LC/0/1/CPU0:Mar 14 >> 14:24:00.737 : sysmgr[89]: %OS-SYSMGR-2-REBOOT : reboot required, >> process >> (pfm_node_lc) reason (pfm_dev_sm_perform_recovery_action, Card reset >> requested by: Process ID: 155724 (prm_server), Fault Sev: 0, Target node: >> 0/1/CPU0, CompId: 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault >> Reason: NP DOUBLE ECC ERROR, NP=1, memId=18, subMemId=0x1) >> 2013-03-14 14:22:38 Local7.Error 9k 16331: LC/0/1/CPU0:Mar 14 >> 14:24:00.741 : sysmgr[89]: %OS-LIBSYSMGR-3-PARSE : parse_args: parse > error: >> unmatched " >> 2013-03-14 14:22:38 Local7.Error 9k 16333: LC/0/1/CPU0:Mar 14 >> 14:24:00.742 : sysmgr[89]: %OS-SYSMGR-3-ERROR : >> sysmgr_shutdown_cleanup_handler: shutdown script execution timed-out! >> Node will reset >> 2013-03-14 14:22:38 Local7.Error 9k 16335: LC/0/1/CPU0:Mar 14 >> 14:24:00.743 : sysmgr[89]: %OS-SYSMGR-3-ERROR : >> sysmgr_shutdown_cleanup_handler: shutdown triggered by (pfm_node_lc) >> did not complete in 45 seconds, shutting down >> >> >> RP/0/RSP0/CPU0:9k#admin sh plat summ location 0/1/CPU0 Fri Mar 15 >> 08:17:12.824 CDT >> ---------------------------------------------------------------------- >> ------ >> --- >> Platform Node : 0/1/CPU0 (slot 1) >> PID : A9K-2T20GE-L >> Card Type : 2-Port 10GE, 20-Port GE Low Queue LC, Req. XFPs >> and SFPs >> VID/SN : V03 / FOC15078GST >> Oper State : IOS XR RUN >> Last Reset : pfm_dev_sm_perform_recovery_action, Card reset >> requested by: Process ID: 155724 (prm_server) >> : Thu Mar 14 19:24:00 2013 >> Configuration : Power is enabled >> Bootup enabled. >> Monitoring enabled >> Rommon Ver : Version 1.03(20100212:011148) >> IOS SW Ver : 4.1.2 >> Main Power : Power state Enabled. Estimate power 350 Watts of >> power required. >> Faults : N/A >> ---------------------------------------------------------------------- >> ------ >> --- >> >> RP/0/RSP0/CPU0:9k#sh instal summ >> Fri Mar 15 08:17:44.055 CDT >> Active Packages: >> disk0:asr9k-mini-p-4.1.2 >> disk0:asr9k-doc-p-4.1.2 >> disk0:asr9k-k9sec-p-4.1.2 >> disk0:asr9k-mpls-p-4.1.2 >> disk0:asr9k-mgbl-p-4.1.2 >> disk0:asr9k-mcast-p-4.1.2 >> >> >> >> aaron >> >> >> >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Jason Lixfeld >> Sent: Thursday, March 14, 2013 5:09 PM >> To: [email protected] NSP >> Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all >> rebooted at once!! >> >> What XR version are you running? >> Trident or Typhoon cards? >> ME3600s all rebooted at the exact moment the LC crashed? >> ME3600 crashes with errors/crashinfo? >> OSPF is your IGP or IGP is something else and OSPF was inside a VRF >> facing the CE? >> Is BFD for IGP and/or BFD for BGP enabled? >> BGP is straight BGP or MP-BPG to the ME3600s? >> LDP between ASR and ME3600s? >> >> I don't have an answer for you, but there are some common elements on >> my network based on the description you have provided here about your >> network, so I'm asking probing questions to determine any other > similarities. >> >> -- >> >> Sent from my mobile device >> >> >> On 2013-03-14, at 5:35 PM, "Aaron" <[email protected]> wrote: >> >>> Y'all know anything about this? >>> >>> >>> >>> Something bad just happened in my network >>> >>> >>> >>> I have an asr9010 that just showed a 2/20 module fail and come back >>> up. the pe-ce link on that card also showed ospf neighbor state >>> bounce at that moment.AND that asr9010 is a route reflector for >>> several of my pe's throughout my network.. Of those pe's (13) >>> ME3600's running 15.3(1)S ALL REBOOTED!!! >>> >>> >>> >>> ..i have another me3600 running 15.3(1)S that is not running bgp that >>> did not reboot >>> >>> >>> >>> ..i have several other me3600's running pre 15.3 (so 15.2.something) >>> that are running similar config as the rebooted me's, which did NOT >>> reboot >>> >>> >>> >>> Aaron >>> >>> >>> >>> _______________________________________________ >>> cisco-nsp mailing list [email protected] >>> https://puck.nether.net/mailman/listinfo/cisco-nsp >>> archive at http://puck.nether.net/pipermail/cisco-nsp/ >> >> _______________________________________________ >> cisco-nsp mailing list [email protected] >> https://puck.nether.net/mailman/listinfo/cisco-nsp >> archive at http://puck.nether.net/pipermail/cisco-nsp/ > _______________________________________________ cisco-nsp mailing list [email protected] https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/
