Thank you very much for this detailed troubleshooting procedure.
 
There was a command that gave me something:
 
sh control errors fab count
 
SLOT 6 :
CellDrop (lane0..3)  765    765    765    765    
 
         CRC   CRC   CRC   CRC   CRC        LOS   LOS   LOS   LOS   LOS  
Counter XBAR0 XBAR1 XBAR2 XBAR3 XBAR4      XBAR0 XBAR1 XBAR2 XBAR3 XBAR4 
Lane0   33601 0     0     0     0          0     0     0     0     0     
Lane1   15058 0     0     0     0          0     0     0     0     0     
Lane2   4509  0     0     0     0          0     0     0     0     0     
Lane3   1619  0     0     0     0          0     0     0     0     0     

So this once again points to something wrong with CSC0. I will replace it to 
see if the problem goes away.
 
 
 
Regards,
 
Antonio Soares, CCIE #18473 (R&S)
[email protected]
 

________________________________

From: Aaron [mailto:[email protected]] 
Sent: segunda-feira, 16 de Novembro de 2009 16:19
To: Antonio Soares
Cc: Leonardo Gama Souza; [email protected]
Subject: Re: [c-nsp] FABRIC-3-ERR_HANDLE


It is normal to have a CSC in standby mode. If something goes wrong with the 
other CSC, it takes over.




Step 1 - Gather data before making any changes

                        term length 0    - so  you don’t have to hit enter

                        show log

                        show tech

                        show monitor event-trace fab

                        show monitor event-trace agent-ctrl

                        show monitor event-trace board_mgr

                        show monitor event-trace lci

                        execute-on all show controllers fia (x5 times or so)

                        show controllers errors fabric counters (x5 times or so)

                        show controllers errors (x5 times or so)

                        show controllers xbar (x5 times or so)

                        show controllers sca (x5 times or so)

                        show controllers clock

                        show controllers fab-clk

 

 

Step 2 - Determine if the issue is with a single or multiple slots, including 
the RP slots

Step 3 - Check location of the primary clock scheduler and if both CSC are 
active (from

show controllers clock) and the number of SFC. If only 1 CSC, troubleshoot 
missing CSC first. Ensure that you will have 4 active
fabric cards  before OIRing card since line cards may go out of service due to 
lack to fabric BW.

Step 4 - CRC- and LOS errors in control path from CSC to SFC cards

Explanation 

>From show controllers xbar, on 120XX chassis look at Interrupt status field, 
>on 124XX and 128XX, look at Control LOS status and
Control CRC error fields.  If 0 then go to step 5.

Check to see which card is primary from show controllers clock and if both are 
present.

If incrementing and the error is on all fabric cards, then OIR primary CSC

If incrementing and the error is only one 1 fabric card, then OIR fabric

If show controllers xbar does not show more errors, then the issue was seating, 
otherwise RMA card

 

Step 5 – CSC Clocking and Synchronization problems 

                        Explaination

                        From show controllers clock and show controllers errors 
(CLKSTS field)

                        Check to see which card is primary from show 
controllers clock.

                        If all the cards are using primary clock (default is 
CSC_0), then go to step 

                        6

Cards not using same clock must be in IOS RUN, RP ACTV or RP STBY, if not, go 
to step 6

If multiple cards not using primary, OIR primary CSC, if still, RMA primary CSC

If single card not using primary, OIR suspect card, if still, RMA suspect card

 

Step 6 – ToFab FIA Halt

Explanation

If a syslog message or from execute-on all show controllers fia we observe 
errors

If the RP has failed over and we have line cards also halted, then suspect the 
chassis or backplane. If only a line card is halted,
the router tries to recover several times, if it cannot recover, the RP resets 
the line card and runs additional tests. If the line
card fails, then RMA the line card

 

Step 7 - CRC and LOS errors between fabric cards and line cards/RPs            

Explanation from LC/RP to Fabric

Explanation from Fabric to LC/RP

Errors are observed from show controller error (not useful on 120XX) and show 
controller errors fabric counters. The DAT_LOS (124XX
and 128XX) and DAT_CRC (128XX only) identify the cards. On a 120XX, the cause 
of errors from LC/RP to fabric can only be determined
by removing 1 card at a time to see if the errors stop. Since the possibility 
is high that a in use line card is the problem, start
with the backbone facing cards first one at a time, then customer facing one at 
a time, then cards not in use one at a time.

If multiple cards show DAT_CRC and DAT_LOS errors, then cause is most likely a 
fabric card determined from the bitmap. Reseat
suspect card to see if errors continue. If so, RMA card.

Show controller errors fabric counters show errors from the fabric. The bitmask 
will determine which one is suspect. Reset suspect
card to see if errors continue. If so, RMA card.




_______________________________________________
cisco-nsp mailing list  [email protected]
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

Reply via email to