Re: [c-nsp] Sporadic loss of LDP neighbor ...

2011-12-16 Thread Garry
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12.12.2011 09:27, Mark Tinka wrote:
 On Monday, December 12, 2011 03:38:56 PM Garry wrote:
 
 Dec 11 22:59:31: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is DOWN
 (Received error notification from peer: Holddown time expired)
 Dec 11 22:59:52: %LDP-5-NBRCHG: LDP Neighbor [BB3]:0 is DOWN
 (Discovery Hello Hold Timer expired) Dec 11 23:00:00:
 %LDP-5-NBRCHG: LDP Neighbor [BB3] is UP Dec 11 23:00:27:
 %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is UP
 
 Are you seeing high CPU utilization on the affected routers, even
 if transient?

I've seen high CPU before, but never in time to discern whether the
CPU was cause or effect ... just this afternoon I was able to catch
one of the outages in time to cross-check multiple places, mainly the
logs and cpu history, which clearly showed that the egg was there
before the hen - or rather, 100% cpu for ~2min followed by the LDP
(and other) outages ... problem is, I can't yet pin-point the cause of
the CPU load - guess I will have to set up a cron job to pull show
proc cpu sort 5min outputs every couple minutes and check which
process is the cause for the cpu load ... hopefully ... (even as I
prepare to drop the 7200's from the essential places, I even see 10%
cpu at the same time on the ASR routers, which is pretty high compared
to the ~1% they usually have ... so I want to solve the problem cause,
not the effect ...)

Tnx, garry
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJO61Y4AAoJEMke62kOY/2CshYIAIXZoPZrlWZ54s/j7nzzqATW
hYStVQViiLuvIE27ue4Nk5LNGpJj8oBLH9h37NydwetGd2/z9xTZUl+YLbEZ+9MB
Ds+yxA20GCV41KoaQ9emafhsruv0j8MgatgaZ1F4WG0oZFFsifRSJcLAmePSHieN
86qkVAIbP0TC57lpeTzUyz50lX3JlvNRiuOKZsmfnQeyLFPwz0N2KKHAVlYPW8kr
bBsfs/uSJqEEEJJKCt9Hn79OVVa3L+wySgiqSwa/fwZUr8e8Gl6srF3LC/DtvkML
K/Qokn4vMUMJvvW7AQONDh6TMbo7vYPlWXeQQ975N2JVE/Mow/OH3E5CG8djGlI=
=BUp+
-END PGP SIGNATURE-
___
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


Re: [c-nsp] Sporadic loss of LDP neighbor ...

2011-12-16 Thread Mark Tinka
On Friday, December 16, 2011 10:31:21 PM Garry wrote:

 I've seen high CPU before, but never in time to discern
 whether the CPU was cause or effect ... just this
 afternoon I was able to catch one of the outages in time
 to cross-check multiple places, mainly the logs and cpu
 history, which clearly showed that the egg was there
 before the hen - or rather, 100% cpu for ~2min followed
 by the LDP (and other) outages ... problem is, I can't
 yet pin-point the cause of the CPU load - guess I will
 have to set up a cron job to pull show proc cpu sort
 5min outputs every couple minutes and check which
 process is the cause for the cpu load ... hopefully ...
 (even as I prepare to drop the 7200's from the essential
 places, I even see 10% cpu at the same time on the ASR
 routers, which is pretty high compared to the ~1% they
 usually have ... so I want to solve the problem cause,
 not the effect ...)

Yes, the LDP session failure hints toward high CPU state 
moments before the LDP sessions go down. Other (routing) 
protoocols are also likely to be affected.

In our case, we've seen this before for two reasons:

1. Removal and insertion of compact flash card
   causes high CPU utilization.

2. Heavy QoS on the 7200.


As you say, figuring out what's killing your CPU will likely 
be the solution to taming this issue.

Mark.


signature.asc
Description: This is a digitally signed message part.
___
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

Re: [c-nsp] Sporadic loss of LDP neighbor ...

2011-12-13 Thread Vitkovsky, Adam
Hi
In this case I'd be careful with these features 
Even though the advantage is that you won't loose the label mappings from the 
particular neighbor in case the direct link fails as the ldp hellos would be 
exchanged between the loopbacks and can be routed around the failed link same 
as the tcp session would 
But in this particular case this can mask the underlying issues that have not 
yet been identified
7200 definitely supports both of these commands 

adam
 

-Original Message-
From: cisco-nsp-boun...@puck.nether.net 
[mailto:cisco-nsp-boun...@puck.nether.net] On Behalf Of Pshem Kowalczyk
Sent: Monday, December 12, 2011 9:37 PM
To: Garry
Cc: cisco-nsp@puck.nether.net
Subject: Re: [c-nsp] Sporadic loss of LDP neighbor ...

Hi,

On 13 December 2011 06:52, Garry g...@gmx.de wrote:
 On 12.12.2011 09:16, Robert Raszuk wrote:
 Garry,

 Do you see the same with mpls ldp targeted-sessions enabled (even for
 normal LDP p2p peers) ? At least this is something I would try first ...

 Neither the 7200s nor the ASR support this command ...

I'm not sure if those commands are supported (and have no access to
7200 or ASR1k right now to verify), but try this:

mpls ldp neighbor 200.200.200.200 targeted
(use the loopback address of the peer, this works for individual peers)

or

mpls ldp session protection

(this enables targeted hellos for all ldp neighbours - it has to be
enabled on both peers)

To see if it's working use show mpls ldp nei detail, you should see
something like this:

Peer LDP Ident: 10.16.16.16:0; Local LDP Ident 10.15.15.15:0
TCP connection: 10.16.16.16.11013 - 10.15.15.15.646
State: Oper; Msgs sent/rcvd: 53/51; Downstream; Last TIB rev sent 74
Up time: 00:11:32; UID: 1; Peer Id 0;
LDP discovery sources:
  Targeted Hello 10.15.15.15 - 10.16.16.16, active, passive;
- this is the targeted session
holdtime: infinite, hello interval: 1 ms
Addresses bound to peer LDP Ident:
  10.0.0.210.16.16.16 10.101.101.101 11.0.0.1
Peer holdtime: 18 ms; KA interval: 6 ms; Peer state: estab
Clients: Dir Adj Client
LDP Session Protection enabled, state: Protecting
duration: infinite

the active,passive means that the router is both sending and accepting
targeted hellos.

kind regards
Pshem
___
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

___
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


Re: [c-nsp] Sporadic loss of LDP neighbor ...

2011-12-12 Thread Robert Raszuk

Garry,

Do you see the same with mpls ldp targeted-sessions enabled (even for 
normal LDP p2p peers) ? At least this is something I would try first ...


Thx,
R.


Hi *,

I've been fighting this problem for quite a while, need some ideas from
the collective intelligence ...

On of our backbone locations has multiple routers that have worked fine
for quite a while ... during the last couple months, we've been
experiencing some sporadic failures in the LAN which I've not been able
to pin-point any logical reason for ...

Basic setup is this ... currently, three 7200 routers (2x NPE300 VXR
[BB1  2], 1x NPE150 [BB3] for a couple of L2TP wireless links). We've
added an AS1002F [Core1] to that as new primary router for the location
about a year ago (running a 300M link to our core uplink, 1G dark fiber
link to another backbone location). All of our backbone is running with
MPLS enabled (multiple VRFs for MPLS-VPNs). Everything fine up until
something like 2-3 months ago (don't have an exact date, otherwise it
might be easier to get some correlations to other changes in the configs
or infrastructure). Then it started with sporadic losses of the LAN
interconnections, like this: (log excerpt from BB2)

Dec 11 22:59:31: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is DOWN (Received
error notification from peer: Holddown time expired)
Dec 11 22:59:52: %LDP-5-NBRCHG: LDP Neighbor [BB3]:0 is DOWN (Discovery
Hello Hold Timer expired)
Dec 11 23:00:00: %LDP-5-NBRCHG: LDP Neighbor [BB3] is UP
Dec 11 23:00:27: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is UP

These interruptions (at least the timestamps between down and up)
sometimes only last 3-4 seconds, the BB1 one above with almost a minute
is just about the longest I've seen to date. Of course this disrupts
routing to a certain degree ... sometimes even bad enough to take down
iBGP/eBGP multihop connections.

Now, at two other backbone locations, we have more or less the identical
setup, without any of these problems. I've already compared interface
configs, but everything seems identical (apart from IP addresses of
course). Problem here is that it's impossible to analyze any of the
problem causes, as for one the problems occur without any predictable
interval, and they're to short to react to the loss of connection in
time ... I've tried activating some debugs on the router, but couldn't
get any helpful information out of it (at least nothing I could identify)

We've recently added an ASR1001 to the site, which (together with the
1002F) will be used to replace two 7200 routers, and already moved about
half of the existing VLANs of the site (~20 of the 40+) to the ASRs.
Didn't change much, though the interval of the interruptions went to
maybe once every 2 or 3 days (from 1-2 per day). One thing I did notice
is that mostly BB1 router is involved, with 1-2 times out of three BB2
also losing LDP connection at the same time, and BB3 usually not showing
any problems reaching either of the Core routers. BB1 and BB2 will also
lose connectivity to each other most of the time, albeit not always. In
attempting to locate the cause, we already moved BB1 to the same switch
as Core12, with no results. Needless to say that there are no
disruptions on Layer 2, at least not as far as could be seen in the logs.

If these problems had manifested themselves when we installed the first
ASR, I'd say it's something in the IOS versions that might be
incompatible, but everything ran fine for something like 9 months, so
that shouldn't be it. I've tried going through config diffs from 4-6
months ago and now, but couldn't find any changes that should break MPLS
on the LAN layer.

Anybody have any idea at what might be causing this, or what I should
check into to get to the cause of this problem?

Here's some excerpts from the router configs:

BB1:
interface GigabitEthernet3/0
  mtu 1500
  no ip redirects
  ip route-cache flow
  negotiation auto
  mpls label protocol ldp
  tag-switching mtu 1520
  tag-switching ip

BB2: identical settings

Core1:
interface GigabitEthernet0/0/0
  no ip redirects
  ip flow ingress
  negotiation auto
  mpls ip
  mpls label protocol ldp
  mpls mtu 1520

Thanks, Garry
___
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/




___
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


Re: [c-nsp] Sporadic loss of LDP neighbor ...

2011-12-12 Thread Mark Tinka
On Monday, December 12, 2011 03:38:56 PM Garry wrote:

 Dec 11 22:59:31: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is
 DOWN (Received error notification from peer: Holddown
 time expired) Dec 11 22:59:52: %LDP-5-NBRCHG: LDP
 Neighbor [BB3]:0 is DOWN (Discovery Hello Hold Timer
 expired)
 Dec 11 23:00:00: %LDP-5-NBRCHG: LDP Neighbor [BB3] is UP
 Dec 11 23:00:27: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is
 UP

Are you seeing high CPU utilization on the affected routers, 
even if transient?

Cheers,

Mark.


signature.asc
Description: This is a digitally signed message part.
___
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

Re: [c-nsp] Sporadic loss of LDP neighbor ...

2011-12-12 Thread Vitkovsky, Adam
Hi,

So the section of the core you have issues with is like a triangle between the 
3 7200s right? -now are the 7200's connected with GigE back-to-back or via 
switch ? -and how is the AS1002F connected to this setup please? -is it 
connected to BB1 and BB2 to replace BB3?

You said you ran a debug -have you been lucky to capture the failure 

From the log output you posted 
Reading the first line it appears that BB2 received an error notification from 
BB1 saying that BB1 is terminating it's TCP connection to BB2 because it didn't 
get any TCP keepalives during the holddown time
(so either BB2 stopped sending TCP keepalives for the LDP session to BB1 or BB1 
just stopped receiving them from BB2 -to figure out which is the case a debug 
would be helpful)
TCP keepalives on cisco are by default sent from ldp router-id to ldp-router-id 
every 6sec with holdtime of 18s

Second line indicates that BB2 has terminated the TCP connection to BB3 because 
BB2 didn't get any LDP hello messages from BB3 during the Hello Hold Timer
(so either BB3 stopped sending hellos on link to BB1 or BB1 just stopped 
receiving them from BB3 -once again a debug would be helpful)
LDP UDP Hellos are by default send from interface ip address to 224.0.0.2 with 
hello interval of 5s and holdtime interval of 15s
-this can be changed using the session protection or targeted hellos features 
-in which case the LDP UDP Hellos are send from LDP router-id to LDP router-id 
with a default hello interval of 10s and holdtime infinite





adam

-Original Message-
From: cisco-nsp-boun...@puck.nether.net 
[mailto:cisco-nsp-boun...@puck.nether.net] On Behalf Of Garry
Sent: Monday, December 12, 2011 8:39 AM
To: cisco-nsp@puck.nether.net
Subject: [c-nsp] Sporadic loss of LDP neighbor ...

Hi *,

I've been fighting this problem for quite a while, need some ideas from
the collective intelligence ...

On of our backbone locations has multiple routers that have worked fine
for quite a while ... during the last couple months, we've been
experiencing some sporadic failures in the LAN which I've not been able
to pin-point any logical reason for ...

Basic setup is this ... currently, three 7200 routers (2x NPE300 VXR
[BB1  2], 1x NPE150 [BB3] for a couple of L2TP wireless links). We've
added an AS1002F [Core1] to that as new primary router for the location
about a year ago (running a 300M link to our core uplink, 1G dark fiber
link to another backbone location). All of our backbone is running with
MPLS enabled (multiple VRFs for MPLS-VPNs). Everything fine up until
something like 2-3 months ago (don't have an exact date, otherwise it
might be easier to get some correlations to other changes in the configs
or infrastructure). Then it started with sporadic losses of the LAN
interconnections, like this: (log excerpt from BB2)

Dec 11 22:59:31: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is DOWN (Received
error notification from peer: Holddown time expired)
Dec 11 22:59:52: %LDP-5-NBRCHG: LDP Neighbor [BB3]:0 is DOWN (Discovery
Hello Hold Timer expired)
Dec 11 23:00:00: %LDP-5-NBRCHG: LDP Neighbor [BB3] is UP
Dec 11 23:00:27: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is UP

These interruptions (at least the timestamps between down and up)
sometimes only last 3-4 seconds, the BB1 one above with almost a minute
is just about the longest I've seen to date. Of course this disrupts
routing to a certain degree ... sometimes even bad enough to take down
iBGP/eBGP multihop connections.

Now, at two other backbone locations, we have more or less the identical
setup, without any of these problems. I've already compared interface
configs, but everything seems identical (apart from IP addresses of
course). Problem here is that it's impossible to analyze any of the
problem causes, as for one the problems occur without any predictable
interval, and they're to short to react to the loss of connection in
time ... I've tried activating some debugs on the router, but couldn't
get any helpful information out of it (at least nothing I could identify)

We've recently added an ASR1001 to the site, which (together with the
1002F) will be used to replace two 7200 routers, and already moved about
half of the existing VLANs of the site (~20 of the 40+) to the ASRs.
Didn't change much, though the interval of the interruptions went to
maybe once every 2 or 3 days (from 1-2 per day). One thing I did notice
is that mostly BB1 router is involved, with 1-2 times out of three BB2
also losing LDP connection at the same time, and BB3 usually not showing
any problems reaching either of the Core routers. BB1 and BB2 will also
lose connectivity to each other most of the time, albeit not always. In
attempting to locate the cause, we already moved BB1 to the same switch
as Core12, with no results. Needless to say that there are no
disruptions on Layer 2, at least not as far as could be seen in the logs.

If these problems had manifested themselves when we installed the first
ASR, I'd say it's something in 

Re: [c-nsp] Sporadic loss of LDP neighbor ...

2011-12-12 Thread Garry
On 12.12.2011 09:16, Robert Raszuk wrote:
 Garry,
 
 Do you see the same with mpls ldp targeted-sessions enabled (even for
 normal LDP p2p peers) ? At least this is something I would try first ...

Neither the 7200s nor the ASR support this command ...

___
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


Re: [c-nsp] Sporadic loss of LDP neighbor ...

2011-12-12 Thread Pshem Kowalczyk
Hi,

On 13 December 2011 06:52, Garry g...@gmx.de wrote:
 On 12.12.2011 09:16, Robert Raszuk wrote:
 Garry,

 Do you see the same with mpls ldp targeted-sessions enabled (even for
 normal LDP p2p peers) ? At least this is something I would try first ...

 Neither the 7200s nor the ASR support this command ...

I'm not sure if those commands are supported (and have no access to
7200 or ASR1k right now to verify), but try this:

mpls ldp neighbor 200.200.200.200 targeted
(use the loopback address of the peer, this works for individual peers)

or

mpls ldp session protection

(this enables targeted hellos for all ldp neighbours - it has to be
enabled on both peers)

To see if it's working use show mpls ldp nei detail, you should see
something like this:

Peer LDP Ident: 10.16.16.16:0; Local LDP Ident 10.15.15.15:0
TCP connection: 10.16.16.16.11013 - 10.15.15.15.646
State: Oper; Msgs sent/rcvd: 53/51; Downstream; Last TIB rev sent 74
Up time: 00:11:32; UID: 1; Peer Id 0;
LDP discovery sources:
  Targeted Hello 10.15.15.15 - 10.16.16.16, active, passive;
- this is the targeted session
holdtime: infinite, hello interval: 1 ms
Addresses bound to peer LDP Ident:
  10.0.0.210.16.16.16 10.101.101.101 11.0.0.1
Peer holdtime: 18 ms; KA interval: 6 ms; Peer state: estab
Clients: Dir Adj Client
LDP Session Protection enabled, state: Protecting
duration: infinite

the active,passive means that the router is both sending and accepting
targeted hellos.

kind regards
Pshem
___
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/