Re: [j-nsp] what happens if HDD on routing-engine fails during the router operation?

Morgan McLean Wed, 26 Jun 2013 12:33:58 -0700

Interestingly enough, last night I had an EX switch have something happen
with its onboard flash, and the thing ate it pretty hard.


Came back up with errors like this, and then just crashed again shortly.

Jun 26 00:48:14  tor-205-a.sv.<snipped> fpc0 Route TCAM rows need not be
redirected on device 0.

Jun 26 00:48:14  tor-205-a.sv.<snipped> fpc0 Route TCAM rows need not be
redirected on device 1.

Jun 26 00:48:15  tor-205-a.sv.<snipped> fpc0 PFEM: Enabling traffic for dev
0

Jun 26 00:48:15  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_FRAG: Attempted to send 68 bytes, actually sent 4
bytes

Jun 26 00:48:15  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_REM: Queuing message remainder, 64 bytes

Jun 26 00:48:15  tor-205-a.sv.<snipped> fpc0 PFEM: Enabling traffic for dev
1

Jun 26 00:48:17  tor-205-a.sv.<snipped> /kernel: RT_PFE: RT msg op 1
(PREFIX ADD) failed, err 5 (Invalid)

Jun 26 00:48:17  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_FRAG: Attempted to send 68 bytes, actually sent 52
bytes

Jun 26 00:48:17  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_REM: Queuing message remainder, 16 bytes

Jun 26 00:48:19  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_FRAG: Attempted to send 68 bytes, actually sent 56
bytes

Jun 26 00:48:20  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_REM: Queuing message remainder, 12 bytes

Jun 26 00:48:21  tor-205-a.sv.<snipped> lldpd[1009]:
LIBESPTASK_SNMP_CONN_RETRY: snmp_epi_reg_refresh: reattempting connection
to SNMP agent (register MIBs): Resource temporarily unavailable

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
READ(10). CDB: 28 0 0 19 6a 0 0 0 20 0

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
CAM Status: SCSI Status Error

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
SCSI Status: Check Condition

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
MEDIUM ERROR asc:11,0

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
Unrecovered read error

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
Retrying Command (per Sense Data)

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
READ(10). CDB: 28 0 0 19 6b 0 0 0 80 0

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
CAM Status: SCSI Status Error

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
SCSI Status: Check Condition

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
ILLEGAL REQUEST asc:20,0

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
Invalid command operation code

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
Unretryable error

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel:
g_vfs_done():da0s3e[READ(offset=67502080, length=65536)]error = 22

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: vnode_pager_getpages: I/O
read error

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: vm_fault: pager read
error, pid 1047 (cp)

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0 pfe_pme_max 24

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0 PFEMAN: Sent Resync request to
Master

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0
MRVL-L2:mrvl_brg_port_stg_entry_set(),293:l2ifl not found for ifl 4!

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0
MRVL-L2:mrvl_brg_port_stg_create(),539:Port-STG-Set failed(Invalid
Params:-2)

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0
RT-HAL,rt_entry_add_msg_proc,2790: l2_halp_vectors->l2_entry_create failed

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0
RT-HAL,rt_entry_add_msg_proc,2883: proto MSTI,len 48 prefix 00004:00254 nh
82

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0 RT-HAL,rt_msg_handler,597:
route process failed


On Wed, Jun 26, 2013 at 5:16 AM, Martin T <[email protected]> wrote:

> I did not try "set chassis redundancy failover on-disk-failure" as
> this should be for GRES configuration, but I have single RE both in
> M10i and M20.
>
>
> regards,
> Martin
>
> 2013/6/26, Per Granath <[email protected]>:
> > Note that this is two different configurations:
> >
> > set chassis routing-engine on-disk-failure disk-failure-action reboot
> > set chassis redundancy failover on-disk-failure
> >
> > Did you try both?
> >
> >
> > -----Original Message-----
> > From: Martin T [mailto:[email protected]]
> > Sent: Wednesday, June 26, 2013 11:58 AM
> > To: Per Granath
> > Cc: [email protected]; [email protected]
> > Subject: Re: [j-nsp] what happens if HDD on routing-engine fails during
> the
> > router operation?
> >
> > Hi,
> >
> > I did now :) However, it had no effect. On the other hand, dismounting
> the
> > /var is not near the same as completely removing or failure of the HDD
> on a
> > working routing-engine.
> >
> >
> > Example with M20:
> >
> > root@M20> show configuration chassis
> > routing-engine {
> >     on-disk-failure disk-failure-action reboot; }
> >
> > root@M20> show system processes brief
> > last pid:  1475;  load averages:  0.00,  0.12,  0.15  up 0+00:11:35
> > 07:08:28
> > 105 processes: 3 running, 86 sleeping, 16 waiting
> >
> > Mem: 136M Active, 115M Inact, 32M Wired, 132M Cache, 69M Buf, 1580M Free
> > Swap: 2048M Total, 2048M Free
> >
> >
> >
> >
> > root@M20> start shell csh
> > root@M20% mount
> > /dev/ad0s1a on / (ufs, local, noatime)
> > devfs on /dev (devfs, local)
> > devfs on /dev/ (devfs, local, noatime, noexec, read-only)
> > /dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only)
> > /dev/md1 on /packages/mnt/jkernel-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md2 on /packages/mnt/jpfe-M40-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md3 on /packages/mnt/jdocs-9.4R3.5 (cd9660, local, noatime,
> read-only)
> > /dev/md4 on /packages/mnt/jroute-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md5 on /packages/mnt/jcrypto-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md6 on /packages/mnt/jpfe-common-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md7 on /tmp (ufs, local, noatime, soft-updates)
> > /dev/md8 on /mfs (ufs, local, noatime, soft-updates) /dev/ad0s1e on
> /config
> > (ufs, local, noatime) procfs on /proc (procfs, local, noatime)
> /dev/ad1s1f
> > on /var (ufs, local, noatime) root@M20% umount -f /var root@M20% mount
> > /dev/ad0s1a on / (ufs, local, noatime) devfs on /dev (devfs, local)
> devfs on
> > /dev/ (devfs, local, noatime, noexec, read-only)
> > /dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only)
> > /dev/md1 on /packages/mnt/jkernel-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md2 on /packages/mnt/jpfe-M40-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md3 on /packages/mnt/jdocs-9.4R3.5 (cd9660, local, noatime,
> read-only)
> > /dev/md4 on /packages/mnt/jroute-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md5 on /packages/mnt/jcrypto-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md6 on /packages/mnt/jpfe-common-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md7 on /tmp (ufs, local, noatime, soft-updates)
> > /dev/md8 on /mfs (ufs, local, noatime, soft-updates) /dev/ad0s1e on
> /config
> > (ufs, local, noatime) procfs on /proc (procfs, local, noatime) root@M20%
> > exit exit
> >
> > root@M20> ?
> > No valid completions
> > root@M20>
> > error: unknown command: .noop-command
> >
> >
> > root@M20> Jun 26 07:09:49 init: can't chdir to /var/tmp/: No such file
> or
> > directory Jun 26 07:09:54 init: can't chdir to /var/tmp/: No such file or
> > directory Jun 26 07:09:59 init: can't chdir to /var/tmp/: No such file or
> > directory Jun 26 07:10:04 init: can't chdir to /var/tmp/: No such file or
> > directory Jun 26 07:10:04 init: can't chdir to /var/tmp/: No such file or
> > directory
> >
> >
> >
> > Example with M10i:
> >
> > root@M10i> show configuration chassis
> > routing-engine {
> >     on-disk-failure disk-failure-action reboot; }
> >
> > root@M10i> show system processes brief
> > last pid:  1473;  load averages:  3.97,  1.22,  0.47  up 0+00:02:46
> > 08:17:13
> > 111 processes: 5 running, 89 sleeping, 17 waiting
> >
> > Mem: 181M Active, 54M Inact, 33M Wired, 216M Cache, 69M Buf, 1012M Free
> > Swap: 2048M Total, 2048M Free
> >
> >
> >
> >
> > root@M10i> start shell csh
> > root@M10i% mount
> > /dev/ad0s1a on / (ufs, local, noatime)
> > devfs on /dev (devfs, local, multilabel)
> > /dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only,
> > verified)
> > /dev/md1 on /packages/mnt/jkernel-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md2 on /packages/mnt/jpfe-M7i-10.4R12.4 (cd9660, local, noatime,
> > read-only)
> > /dev/md3 on /packages/mnt/jdocs-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md4 on /packages/mnt/jroute-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md5 on /packages/mnt/jcrypto-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md6 on /packages/mnt/jpfe-common-10.4R12.4 (cd9660, local, noatime,
> > read-only)
> > /dev/md7 on /packages/mnt/jruntime-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md8 on /tmp (ufs, asynchronous, local, noatime)
> > /dev/md9 on /mfs (ufs, asynchronous, local, noatime) /dev/ad0s1e on
> /config
> > (ufs, local, noatime) procfs on /proc (procfs, local, noatime)
> /dev/ad1s1f
> > on /var (ufs, local, noatime) root@M10i% umount -f /var root@M10i% mount
> > /dev/ad0s1a on / (ufs, local, noatime) devfs on /dev (devfs, local,
> > multilabel)
> > /dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only,
> > verified)
> > /dev/md1 on /packages/mnt/jkernel-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md2 on /packages/mnt/jpfe-M7i-10.4R12.4 (cd9660, local, noatime,
> > read-only)
> > /dev/md3 on /packages/mnt/jdocs-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md4 on /packages/mnt/jroute-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md5 on /packages/mnt/jcrypto-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md6 on /packages/mnt/jpfe-common-10.4R12.4 (cd9660, local, noatime,
> > read-only)
> > /dev/md7 on /packages/mnt/jruntime-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md8 on /tmp (ufs, asynchronous, local, noatime)
> > /dev/md9 on /mfs (ufs, asynchronous, local, noatime) /dev/ad0s1e on
> /config
> > (ufs, local, noatime) procfs on /proc (procfs, local, noatime) root@M10i
> %
> > Jun 26 08:18:04 init: can't chdir to /var/tmp/: No such file or directory
> > exit exit
> >
> > root@M10i> Jun 26 08:18:09 init: can't chdir to /var/tmp/: No such file
> or
> > directory ?
> > No valid completions
> > root@M10i> Jun 26 08:18:15 init: can't chdir to /var/tmp/: No such file
> or
> > directory Jun 26 08:18:20 init: can't chdir to /var/tmp/: No such file or
> > directory Jun 26 08:18:20 init: can't chdir to /var/tmp/: No such file or
> > directory
> >
> >
> > One other important thing what happens if HDD fails is that swap space is
> > lost. This is probably rather critical with for example RE-333-256.
> > In addition, looks like the RE-850 has no problems with booting up
> without
> > the HDD while RE-600 or RE-333 do not boot up without HDD..
> >
> >
> > Still, what exactly makes the RE reload when HDD is lost?
> >
> >
> > regards,
> > Martin
> >
> > 2013/6/26, Per Granath <[email protected]>:
> >> Did you try it with this configuration?
> >>
> >> chassis {
> >>     redundancy {
> >>         failover {
> >>             on-loss-of-keepalives;
> >>             on-disk-failure;
> >>         }
> >>     }
> >> }
> >>
> >>
> >>
> >> _______________________________________________
> >> juniper-nsp mailing list [email protected]
> >> https://puck.nether.net/mailman/listinfo/juniper-nsp
> >>
> >
> _______________________________________________
> juniper-nsp mailing list [email protected]
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>



-- 
Thanks,
Morgan
_______________________________________________
juniper-nsp mailing list [email protected]
https://puck.nether.net/mailman/listinfo/juniper-nsp

Re: [j-nsp] what happens if HDD on routing-engine fails during the router operation?

Reply via email to