On Wed, 28 Nov 2018 at 21:50, Andrew Elwell <andrew.elw...@gmail.com> wrote:
> I'm running into a problem (robinhood server locking up) scanning one
> of our filesystems (lustre 2.10.5, robinhood 3.1.4).

So, following up on this back to the list as it may be of interest to
others - I've narrowed down what causes the crash:
- this involved a couple of 'Ignore' directives to begin with, and
then narrowed down using partial scans to see where it crashed.

We had made a copy of another system including special character
devices onto our lustre filesystem before upgrading last year

bravo:/group/pawsey0001/cle6_prep/chaos-smw/opt/cray/hss-images/7.1.UP00PS02
# ls -l lib/udev/devices/
total 12
crw-------+ 1 root pawsey0001   5,   1 Jan 24  2012 console
lrwxrwxrwx  1 root pawsey0001       11 Jan  8  2014 core -> /proc/kcore
lrwxrwxrwx  1 root pawsey0001       13 Jan  8  2014 fd -> /proc/self/fd
crw-rw----+ 1 root pawsey0001  10, 200 Jan 24  2012 fwmonitor
crw-rw----+ 1 root pawsey0001   1,  11 Jan 24  2012 kmsg
crw-rw----+ 1 root pawsey0001   6,   0 Jan 24  2012 lp0
drwxr-xr-x+ 2 root pawsey0001     4096 Jan  8  2014 net
crw-rw-rw-+ 1 root pawsey0001   1,   3 Jan 24  2012 null
crw-rw----+ 1 root pawsey0001 108,   0 Jan 24  2012 ppp
crw-rw-rw-+ 1 root pawsey0001   5,   2 Jan 24  2012 ptmx
drwxr-xr-x+ 2 root pawsey0001     4096 Jan 24  2012 pts
crw-rw----+ 1 root pawsey0001  36,   0 Jan 24  2012 route
drwxr-xr-x+ 2 root pawsey0001     4096 Jan 24  2012 shm
crw-rw----+ 1 root pawsey0001  10, 200 Jan 24  2012 skip
lrwxrwxrwx  1 root pawsey0001       15 Jan  8  2014 stderr -> /proc/self/fd/2
lrwxrwxrwx  1 root pawsey0001       15 Jan  8  2014 stdin -> /proc/self/fd/0
lrwxrwxrwx  1 root pawsey0001       15 Jan  8  2014 stdout -> /proc/self/fd/1
crw-rw-rw-+ 1 root pawsey0001   5,   0 Jan 24  2012 tty
crw--w----+ 1 root pawsey0001   4,   1 Jan 24  2012 tty1
crw-rw----+ 1 root pawsey0001   4,  64 Jan 24  2012 ttyS0
crw-rw----+ 1 root pawsey0001   4,  65 Jan 24  2012 ttyS1
crw-rw----+ 1 root pawsey0001   4,  66 Jan 24  2012 ttyS2
crw-rw----+ 1 root pawsey0001   4,  67 Jan 24  2012 ttyS3
crw-rw----+ 1 root pawsey0001   4,  68 Jan 24  2012 ttyS4
crw-rw----+ 1 root pawsey0001   4,  69 Jan 24  2012 ttyS5
crw-rw----+ 1 root pawsey0001   4,  70 Jan 24  2012 ttyS6
crw-rw----+ 1 root pawsey0001   4,  71 Jan 24  2012 ttyS7
crw-rw----+ 1 root pawsey0001  10, 130 Jan 24  2012 watchdog
crw-rw-rw-+ 1 root pawsey0001   1,   5 Jan 24  2012 zero


and to reproducibly kill the box, all I need to do is
bravo:/group/pawsey0001/cle6_prep/chaos-smw/opt/cray/hss-images/7.1.UP00PS02
# lfs getstripe ./lib/udev/devices/watchdog
error: can't get lov name.: Inappropriate ioctl for device (25)
./lib/udev/devices/watchdog has no stripe info
bravo:/group/pawsey0001/cle6_prep/chaos-smw/opt/cray/hss-images/7.1.UP00PS02 #


and lo

bravo login: [  458.710722] Kernel panic - not syncing: 02: An NMI
occurred. Depending on your system the reason for the NMI is logged in
any one of the following resources:
[  458.710722] 1. Integrated Management Log (IML)
[  458.710722] 2. OA Syslog
[  458.710722] 3. OA Forward Progress Log
[  458.710722] 4. iLO Event Log
[  458.710729] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE
N  4.4.162-94.72-default #1
[  458.710730] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360
Gen9, BIOS P89 01/22/2018
[  458.710736]  0000000000000000 ffffffff8132cdc0 ffffffffa0409000
ffff883f7fa09e28
[  458.710739]  ffffffff81193c21 ffffffff00000008 ffff883f7fa09e38
ffff883f7fa09dd8
[  458.710742]  0000000000000000 ffffc9000ced6072 0000000000000000
000000000000001f
[  458.710742] Call Trace:
[  458.710765]  [<ffffffff81019b09>] dump_trace+0x59/0x340
[  458.710773]  [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170
[  458.710779]  [<ffffffff8101acb1>] show_stack+0x21/0x40
[  458.710788]  [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c
[  458.710796]  [<ffffffff81193c21>] panic+0xd2/0x232
[  458.710807]  [<ffffffffa04074d4>] hpwdt_pretimeout+0xa4/0xa4 [hpwdt]
[  458.710828]  [<ffffffff8101b009>] nmi_handle+0x69/0x110
[  458.710835]  [<ffffffff8101b476>] io_check_error+0x16/0xb0
[  458.710840]  [<ffffffff8101b5a4>] default_do_nmi+0x94/0x110
[  458.710846]  [<ffffffff8101b6fc>] do_nmi+0xdc/0x150
[  458.710854]  [<ffffffff816218f1>] end_repeat_nmi+0xa6/0xae
[  458.715742] DWARF2 unwinder stuck at end_repeat_nmi+0xa6/0xae
[  458.715742]
[  458.715743] Leftover inexact backtrace:
[  458.715743]
[  458.715750]  <NMI>  [<ffffffff813adfe1>] ? intel_idle+0xc1/0x130
[  458.715753]  [<ffffffff813adfe1>] ? intel_idle+0xc1/0x130
[  458.715755]  [<ffffffff813adfe1>] ? intel_idle+0xc1/0x130
[  458.715762]  <<EOE>>  [<ffffffff814db39c>] ? cpuidle_enter_state+0x9c/0x260
[  458.715767]  [<ffffffff810c6ddf>] ? cpu_startup_entry+0x29f/0x390
[  458.715775]  [<ffffffff81f8f0c7>] ? start_kernel+0x4c8/0x4d3
[  458.715778]  [<ffffffff81f8ea03>] ? set_init_arg+0x50/0x50
[  458.715783]  [<ffffffff81f8e120>] ? early_idt_handler_array+0x120/0x120
[  458.715787]  [<ffffffff81f8e719>] ? x86_64_start_kernel+0x147/0x156
[  458.760426] Kernel Offset: disabled
[  458.774530] ERST: [Firmware Warn]: Firmware does not respond in time.
[  459.770516] ---[ end Kernel panic - not syncing: 02: An NMI
occurred. Depending on your system the reason for the NMI is logged in
any one of the following resources:
[  459.770516] 1. Integrated Management Log (IML)
[  459.770516] 2. OA Syslog
[  459.770516] 3. OA Forward Progress Log
[  459.770516] 4. iLO Event Log


so, robinhood is off the hook :-)

Andrew


_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to