Re: [c-nsp] Sporadic loss of LDP neighbor ...
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12.12.2011 09:27, Mark Tinka wrote: On Monday, December 12, 2011 03:38:56 PM Garry wrote: Dec 11 22:59:31: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is DOWN (Received error notification from peer: Holddown time expired) Dec 11 22:59:52: %LDP-5-NBRCHG: LDP Neighbor [BB3]:0 is DOWN (Discovery Hello Hold Timer expired) Dec 11 23:00:00: %LDP-5-NBRCHG: LDP Neighbor [BB3] is UP Dec 11 23:00:27: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is UP Are you seeing high CPU utilization on the affected routers, even if transient? I've seen high CPU before, but never in time to discern whether the CPU was cause or effect ... just this afternoon I was able to catch one of the outages in time to cross-check multiple places, mainly the logs and cpu history, which clearly showed that the egg was there before the hen - or rather, 100% cpu for ~2min followed by the LDP (and other) outages ... problem is, I can't yet pin-point the cause of the CPU load - guess I will have to set up a cron job to pull show proc cpu sort 5min outputs every couple minutes and check which process is the cause for the cpu load ... hopefully ... (even as I prepare to drop the 7200's from the essential places, I even see 10% cpu at the same time on the ASR routers, which is pretty high compared to the ~1% they usually have ... so I want to solve the problem cause, not the effect ...) Tnx, garry -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJO61Y4AAoJEMke62kOY/2CshYIAIXZoPZrlWZ54s/j7nzzqATW hYStVQViiLuvIE27ue4Nk5LNGpJj8oBLH9h37NydwetGd2/z9xTZUl+YLbEZ+9MB Ds+yxA20GCV41KoaQ9emafhsruv0j8MgatgaZ1F4WG0oZFFsifRSJcLAmePSHieN 86qkVAIbP0TC57lpeTzUyz50lX3JlvNRiuOKZsmfnQeyLFPwz0N2KKHAVlYPW8kr bBsfs/uSJqEEEJJKCt9Hn79OVVa3L+wySgiqSwa/fwZUr8e8Gl6srF3LC/DtvkML K/Qokn4vMUMJvvW7AQONDh6TMbo7vYPlWXeQQ975N2JVE/Mow/OH3E5CG8djGlI= =BUp+ -END PGP SIGNATURE- ___ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: [c-nsp] Sporadic loss of LDP neighbor ...
On Friday, December 16, 2011 10:31:21 PM Garry wrote: I've seen high CPU before, but never in time to discern whether the CPU was cause or effect ... just this afternoon I was able to catch one of the outages in time to cross-check multiple places, mainly the logs and cpu history, which clearly showed that the egg was there before the hen - or rather, 100% cpu for ~2min followed by the LDP (and other) outages ... problem is, I can't yet pin-point the cause of the CPU load - guess I will have to set up a cron job to pull show proc cpu sort 5min outputs every couple minutes and check which process is the cause for the cpu load ... hopefully ... (even as I prepare to drop the 7200's from the essential places, I even see 10% cpu at the same time on the ASR routers, which is pretty high compared to the ~1% they usually have ... so I want to solve the problem cause, not the effect ...) Yes, the LDP session failure hints toward high CPU state moments before the LDP sessions go down. Other (routing) protoocols are also likely to be affected. In our case, we've seen this before for two reasons: 1. Removal and insertion of compact flash card causes high CPU utilization. 2. Heavy QoS on the 7200. As you say, figuring out what's killing your CPU will likely be the solution to taming this issue. Mark. signature.asc Description: This is a digitally signed message part. ___ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: [c-nsp] Sporadic loss of LDP neighbor ...
Hi In this case I'd be careful with these features Even though the advantage is that you won't loose the label mappings from the particular neighbor in case the direct link fails as the ldp hellos would be exchanged between the loopbacks and can be routed around the failed link same as the tcp session would But in this particular case this can mask the underlying issues that have not yet been identified 7200 definitely supports both of these commands adam -Original Message- From: cisco-nsp-boun...@puck.nether.net [mailto:cisco-nsp-boun...@puck.nether.net] On Behalf Of Pshem Kowalczyk Sent: Monday, December 12, 2011 9:37 PM To: Garry Cc: cisco-nsp@puck.nether.net Subject: Re: [c-nsp] Sporadic loss of LDP neighbor ... Hi, On 13 December 2011 06:52, Garry g...@gmx.de wrote: On 12.12.2011 09:16, Robert Raszuk wrote: Garry, Do you see the same with mpls ldp targeted-sessions enabled (even for normal LDP p2p peers) ? At least this is something I would try first ... Neither the 7200s nor the ASR support this command ... I'm not sure if those commands are supported (and have no access to 7200 or ASR1k right now to verify), but try this: mpls ldp neighbor 200.200.200.200 targeted (use the loopback address of the peer, this works for individual peers) or mpls ldp session protection (this enables targeted hellos for all ldp neighbours - it has to be enabled on both peers) To see if it's working use show mpls ldp nei detail, you should see something like this: Peer LDP Ident: 10.16.16.16:0; Local LDP Ident 10.15.15.15:0 TCP connection: 10.16.16.16.11013 - 10.15.15.15.646 State: Oper; Msgs sent/rcvd: 53/51; Downstream; Last TIB rev sent 74 Up time: 00:11:32; UID: 1; Peer Id 0; LDP discovery sources: Targeted Hello 10.15.15.15 - 10.16.16.16, active, passive; - this is the targeted session holdtime: infinite, hello interval: 1 ms Addresses bound to peer LDP Ident: 10.0.0.210.16.16.16 10.101.101.101 11.0.0.1 Peer holdtime: 18 ms; KA interval: 6 ms; Peer state: estab Clients: Dir Adj Client LDP Session Protection enabled, state: Protecting duration: infinite the active,passive means that the router is both sending and accepting targeted hellos. kind regards Pshem ___ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/ ___ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: [c-nsp] Sporadic loss of LDP neighbor ...
Garry, Do you see the same with mpls ldp targeted-sessions enabled (even for normal LDP p2p peers) ? At least this is something I would try first ... Thx, R. Hi *, I've been fighting this problem for quite a while, need some ideas from the collective intelligence ... On of our backbone locations has multiple routers that have worked fine for quite a while ... during the last couple months, we've been experiencing some sporadic failures in the LAN which I've not been able to pin-point any logical reason for ... Basic setup is this ... currently, three 7200 routers (2x NPE300 VXR [BB1 2], 1x NPE150 [BB3] for a couple of L2TP wireless links). We've added an AS1002F [Core1] to that as new primary router for the location about a year ago (running a 300M link to our core uplink, 1G dark fiber link to another backbone location). All of our backbone is running with MPLS enabled (multiple VRFs for MPLS-VPNs). Everything fine up until something like 2-3 months ago (don't have an exact date, otherwise it might be easier to get some correlations to other changes in the configs or infrastructure). Then it started with sporadic losses of the LAN interconnections, like this: (log excerpt from BB2) Dec 11 22:59:31: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is DOWN (Received error notification from peer: Holddown time expired) Dec 11 22:59:52: %LDP-5-NBRCHG: LDP Neighbor [BB3]:0 is DOWN (Discovery Hello Hold Timer expired) Dec 11 23:00:00: %LDP-5-NBRCHG: LDP Neighbor [BB3] is UP Dec 11 23:00:27: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is UP These interruptions (at least the timestamps between down and up) sometimes only last 3-4 seconds, the BB1 one above with almost a minute is just about the longest I've seen to date. Of course this disrupts routing to a certain degree ... sometimes even bad enough to take down iBGP/eBGP multihop connections. Now, at two other backbone locations, we have more or less the identical setup, without any of these problems. I've already compared interface configs, but everything seems identical (apart from IP addresses of course). Problem here is that it's impossible to analyze any of the problem causes, as for one the problems occur without any predictable interval, and they're to short to react to the loss of connection in time ... I've tried activating some debugs on the router, but couldn't get any helpful information out of it (at least nothing I could identify) We've recently added an ASR1001 to the site, which (together with the 1002F) will be used to replace two 7200 routers, and already moved about half of the existing VLANs of the site (~20 of the 40+) to the ASRs. Didn't change much, though the interval of the interruptions went to maybe once every 2 or 3 days (from 1-2 per day). One thing I did notice is that mostly BB1 router is involved, with 1-2 times out of three BB2 also losing LDP connection at the same time, and BB3 usually not showing any problems reaching either of the Core routers. BB1 and BB2 will also lose connectivity to each other most of the time, albeit not always. In attempting to locate the cause, we already moved BB1 to the same switch as Core12, with no results. Needless to say that there are no disruptions on Layer 2, at least not as far as could be seen in the logs. If these problems had manifested themselves when we installed the first ASR, I'd say it's something in the IOS versions that might be incompatible, but everything ran fine for something like 9 months, so that shouldn't be it. I've tried going through config diffs from 4-6 months ago and now, but couldn't find any changes that should break MPLS on the LAN layer. Anybody have any idea at what might be causing this, or what I should check into to get to the cause of this problem? Here's some excerpts from the router configs: BB1: interface GigabitEthernet3/0 mtu 1500 no ip redirects ip route-cache flow negotiation auto mpls label protocol ldp tag-switching mtu 1520 tag-switching ip BB2: identical settings Core1: interface GigabitEthernet0/0/0 no ip redirects ip flow ingress negotiation auto mpls ip mpls label protocol ldp mpls mtu 1520 Thanks, Garry ___ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/ ___ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: [c-nsp] Sporadic loss of LDP neighbor ...
On Monday, December 12, 2011 03:38:56 PM Garry wrote: Dec 11 22:59:31: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is DOWN (Received error notification from peer: Holddown time expired) Dec 11 22:59:52: %LDP-5-NBRCHG: LDP Neighbor [BB3]:0 is DOWN (Discovery Hello Hold Timer expired) Dec 11 23:00:00: %LDP-5-NBRCHG: LDP Neighbor [BB3] is UP Dec 11 23:00:27: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is UP Are you seeing high CPU utilization on the affected routers, even if transient? Cheers, Mark. signature.asc Description: This is a digitally signed message part. ___ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: [c-nsp] Sporadic loss of LDP neighbor ...
Hi, So the section of the core you have issues with is like a triangle between the 3 7200s right? -now are the 7200's connected with GigE back-to-back or via switch ? -and how is the AS1002F connected to this setup please? -is it connected to BB1 and BB2 to replace BB3? You said you ran a debug -have you been lucky to capture the failure From the log output you posted Reading the first line it appears that BB2 received an error notification from BB1 saying that BB1 is terminating it's TCP connection to BB2 because it didn't get any TCP keepalives during the holddown time (so either BB2 stopped sending TCP keepalives for the LDP session to BB1 or BB1 just stopped receiving them from BB2 -to figure out which is the case a debug would be helpful) TCP keepalives on cisco are by default sent from ldp router-id to ldp-router-id every 6sec with holdtime of 18s Second line indicates that BB2 has terminated the TCP connection to BB3 because BB2 didn't get any LDP hello messages from BB3 during the Hello Hold Timer (so either BB3 stopped sending hellos on link to BB1 or BB1 just stopped receiving them from BB3 -once again a debug would be helpful) LDP UDP Hellos are by default send from interface ip address to 224.0.0.2 with hello interval of 5s and holdtime interval of 15s -this can be changed using the session protection or targeted hellos features -in which case the LDP UDP Hellos are send from LDP router-id to LDP router-id with a default hello interval of 10s and holdtime infinite adam -Original Message- From: cisco-nsp-boun...@puck.nether.net [mailto:cisco-nsp-boun...@puck.nether.net] On Behalf Of Garry Sent: Monday, December 12, 2011 8:39 AM To: cisco-nsp@puck.nether.net Subject: [c-nsp] Sporadic loss of LDP neighbor ... Hi *, I've been fighting this problem for quite a while, need some ideas from the collective intelligence ... On of our backbone locations has multiple routers that have worked fine for quite a while ... during the last couple months, we've been experiencing some sporadic failures in the LAN which I've not been able to pin-point any logical reason for ... Basic setup is this ... currently, three 7200 routers (2x NPE300 VXR [BB1 2], 1x NPE150 [BB3] for a couple of L2TP wireless links). We've added an AS1002F [Core1] to that as new primary router for the location about a year ago (running a 300M link to our core uplink, 1G dark fiber link to another backbone location). All of our backbone is running with MPLS enabled (multiple VRFs for MPLS-VPNs). Everything fine up until something like 2-3 months ago (don't have an exact date, otherwise it might be easier to get some correlations to other changes in the configs or infrastructure). Then it started with sporadic losses of the LAN interconnections, like this: (log excerpt from BB2) Dec 11 22:59:31: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is DOWN (Received error notification from peer: Holddown time expired) Dec 11 22:59:52: %LDP-5-NBRCHG: LDP Neighbor [BB3]:0 is DOWN (Discovery Hello Hold Timer expired) Dec 11 23:00:00: %LDP-5-NBRCHG: LDP Neighbor [BB3] is UP Dec 11 23:00:27: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is UP These interruptions (at least the timestamps between down and up) sometimes only last 3-4 seconds, the BB1 one above with almost a minute is just about the longest I've seen to date. Of course this disrupts routing to a certain degree ... sometimes even bad enough to take down iBGP/eBGP multihop connections. Now, at two other backbone locations, we have more or less the identical setup, without any of these problems. I've already compared interface configs, but everything seems identical (apart from IP addresses of course). Problem here is that it's impossible to analyze any of the problem causes, as for one the problems occur without any predictable interval, and they're to short to react to the loss of connection in time ... I've tried activating some debugs on the router, but couldn't get any helpful information out of it (at least nothing I could identify) We've recently added an ASR1001 to the site, which (together with the 1002F) will be used to replace two 7200 routers, and already moved about half of the existing VLANs of the site (~20 of the 40+) to the ASRs. Didn't change much, though the interval of the interruptions went to maybe once every 2 or 3 days (from 1-2 per day). One thing I did notice is that mostly BB1 router is involved, with 1-2 times out of three BB2 also losing LDP connection at the same time, and BB3 usually not showing any problems reaching either of the Core routers. BB1 and BB2 will also lose connectivity to each other most of the time, albeit not always. In attempting to locate the cause, we already moved BB1 to the same switch as Core12, with no results. Needless to say that there are no disruptions on Layer 2, at least not as far as could be seen in the logs. If these problems had manifested themselves when we installed the first ASR, I'd say it's something in
Re: [c-nsp] Sporadic loss of LDP neighbor ...
On 12.12.2011 09:16, Robert Raszuk wrote: Garry, Do you see the same with mpls ldp targeted-sessions enabled (even for normal LDP p2p peers) ? At least this is something I would try first ... Neither the 7200s nor the ASR support this command ... ___ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: [c-nsp] Sporadic loss of LDP neighbor ...
Hi, On 13 December 2011 06:52, Garry g...@gmx.de wrote: On 12.12.2011 09:16, Robert Raszuk wrote: Garry, Do you see the same with mpls ldp targeted-sessions enabled (even for normal LDP p2p peers) ? At least this is something I would try first ... Neither the 7200s nor the ASR support this command ... I'm not sure if those commands are supported (and have no access to 7200 or ASR1k right now to verify), but try this: mpls ldp neighbor 200.200.200.200 targeted (use the loopback address of the peer, this works for individual peers) or mpls ldp session protection (this enables targeted hellos for all ldp neighbours - it has to be enabled on both peers) To see if it's working use show mpls ldp nei detail, you should see something like this: Peer LDP Ident: 10.16.16.16:0; Local LDP Ident 10.15.15.15:0 TCP connection: 10.16.16.16.11013 - 10.15.15.15.646 State: Oper; Msgs sent/rcvd: 53/51; Downstream; Last TIB rev sent 74 Up time: 00:11:32; UID: 1; Peer Id 0; LDP discovery sources: Targeted Hello 10.15.15.15 - 10.16.16.16, active, passive; - this is the targeted session holdtime: infinite, hello interval: 1 ms Addresses bound to peer LDP Ident: 10.0.0.210.16.16.16 10.101.101.101 11.0.0.1 Peer holdtime: 18 ms; KA interval: 6 ms; Peer state: estab Clients: Dir Adj Client LDP Session Protection enabled, state: Protecting duration: infinite the active,passive means that the router is both sending and accepting targeted hellos. kind regards Pshem ___ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/