On Wed, 28 Nov 2018 at 21:50, Andrew Elwell <andrew.elw...@gmail.com> wrote: > I'm running into a problem (robinhood server locking up) scanning one > of our filesystems (lustre 2.10.5, robinhood 3.1.4).
So, following up on this back to the list as it may be of interest to others - I've narrowed down what causes the crash: - this involved a couple of 'Ignore' directives to begin with, and then narrowed down using partial scans to see where it crashed. We had made a copy of another system including special character devices onto our lustre filesystem before upgrading last year bravo:/group/pawsey0001/cle6_prep/chaos-smw/opt/cray/hss-images/7.1.UP00PS02 # ls -l lib/udev/devices/ total 12 crw-------+ 1 root pawsey0001 5, 1 Jan 24 2012 console lrwxrwxrwx 1 root pawsey0001 11 Jan 8 2014 core -> /proc/kcore lrwxrwxrwx 1 root pawsey0001 13 Jan 8 2014 fd -> /proc/self/fd crw-rw----+ 1 root pawsey0001 10, 200 Jan 24 2012 fwmonitor crw-rw----+ 1 root pawsey0001 1, 11 Jan 24 2012 kmsg crw-rw----+ 1 root pawsey0001 6, 0 Jan 24 2012 lp0 drwxr-xr-x+ 2 root pawsey0001 4096 Jan 8 2014 net crw-rw-rw-+ 1 root pawsey0001 1, 3 Jan 24 2012 null crw-rw----+ 1 root pawsey0001 108, 0 Jan 24 2012 ppp crw-rw-rw-+ 1 root pawsey0001 5, 2 Jan 24 2012 ptmx drwxr-xr-x+ 2 root pawsey0001 4096 Jan 24 2012 pts crw-rw----+ 1 root pawsey0001 36, 0 Jan 24 2012 route drwxr-xr-x+ 2 root pawsey0001 4096 Jan 24 2012 shm crw-rw----+ 1 root pawsey0001 10, 200 Jan 24 2012 skip lrwxrwxrwx 1 root pawsey0001 15 Jan 8 2014 stderr -> /proc/self/fd/2 lrwxrwxrwx 1 root pawsey0001 15 Jan 8 2014 stdin -> /proc/self/fd/0 lrwxrwxrwx 1 root pawsey0001 15 Jan 8 2014 stdout -> /proc/self/fd/1 crw-rw-rw-+ 1 root pawsey0001 5, 0 Jan 24 2012 tty crw--w----+ 1 root pawsey0001 4, 1 Jan 24 2012 tty1 crw-rw----+ 1 root pawsey0001 4, 64 Jan 24 2012 ttyS0 crw-rw----+ 1 root pawsey0001 4, 65 Jan 24 2012 ttyS1 crw-rw----+ 1 root pawsey0001 4, 66 Jan 24 2012 ttyS2 crw-rw----+ 1 root pawsey0001 4, 67 Jan 24 2012 ttyS3 crw-rw----+ 1 root pawsey0001 4, 68 Jan 24 2012 ttyS4 crw-rw----+ 1 root pawsey0001 4, 69 Jan 24 2012 ttyS5 crw-rw----+ 1 root pawsey0001 4, 70 Jan 24 2012 ttyS6 crw-rw----+ 1 root pawsey0001 4, 71 Jan 24 2012 ttyS7 crw-rw----+ 1 root pawsey0001 10, 130 Jan 24 2012 watchdog crw-rw-rw-+ 1 root pawsey0001 1, 5 Jan 24 2012 zero and to reproducibly kill the box, all I need to do is bravo:/group/pawsey0001/cle6_prep/chaos-smw/opt/cray/hss-images/7.1.UP00PS02 # lfs getstripe ./lib/udev/devices/watchdog error: can't get lov name.: Inappropriate ioctl for device (25) ./lib/udev/devices/watchdog has no stripe info bravo:/group/pawsey0001/cle6_prep/chaos-smw/opt/cray/hss-images/7.1.UP00PS02 # and lo bravo login: [ 458.710722] Kernel panic - not syncing: 02: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources: [ 458.710722] 1. Integrated Management Log (IML) [ 458.710722] 2. OA Syslog [ 458.710722] 3. OA Forward Progress Log [ 458.710722] 4. iLO Event Log [ 458.710729] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE N 4.4.162-94.72-default #1 [ 458.710730] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 01/22/2018 [ 458.710736] 0000000000000000 ffffffff8132cdc0 ffffffffa0409000 ffff883f7fa09e28 [ 458.710739] ffffffff81193c21 ffffffff00000008 ffff883f7fa09e38 ffff883f7fa09dd8 [ 458.710742] 0000000000000000 ffffc9000ced6072 0000000000000000 000000000000001f [ 458.710742] Call Trace: [ 458.710765] [<ffffffff81019b09>] dump_trace+0x59/0x340 [ 458.710773] [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170 [ 458.710779] [<ffffffff8101acb1>] show_stack+0x21/0x40 [ 458.710788] [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c [ 458.710796] [<ffffffff81193c21>] panic+0xd2/0x232 [ 458.710807] [<ffffffffa04074d4>] hpwdt_pretimeout+0xa4/0xa4 [hpwdt] [ 458.710828] [<ffffffff8101b009>] nmi_handle+0x69/0x110 [ 458.710835] [<ffffffff8101b476>] io_check_error+0x16/0xb0 [ 458.710840] [<ffffffff8101b5a4>] default_do_nmi+0x94/0x110 [ 458.710846] [<ffffffff8101b6fc>] do_nmi+0xdc/0x150 [ 458.710854] [<ffffffff816218f1>] end_repeat_nmi+0xa6/0xae [ 458.715742] DWARF2 unwinder stuck at end_repeat_nmi+0xa6/0xae [ 458.715742] [ 458.715743] Leftover inexact backtrace: [ 458.715743] [ 458.715750] <NMI> [<ffffffff813adfe1>] ? intel_idle+0xc1/0x130 [ 458.715753] [<ffffffff813adfe1>] ? intel_idle+0xc1/0x130 [ 458.715755] [<ffffffff813adfe1>] ? intel_idle+0xc1/0x130 [ 458.715762] <<EOE>> [<ffffffff814db39c>] ? cpuidle_enter_state+0x9c/0x260 [ 458.715767] [<ffffffff810c6ddf>] ? cpu_startup_entry+0x29f/0x390 [ 458.715775] [<ffffffff81f8f0c7>] ? start_kernel+0x4c8/0x4d3 [ 458.715778] [<ffffffff81f8ea03>] ? set_init_arg+0x50/0x50 [ 458.715783] [<ffffffff81f8e120>] ? early_idt_handler_array+0x120/0x120 [ 458.715787] [<ffffffff81f8e719>] ? x86_64_start_kernel+0x147/0x156 [ 458.760426] Kernel Offset: disabled [ 458.774530] ERST: [Firmware Warn]: Firmware does not respond in time. [ 459.770516] ---[ end Kernel panic - not syncing: 02: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources: [ 459.770516] 1. Integrated Management Log (IML) [ 459.770516] 2. OA Syslog [ 459.770516] 3. OA Forward Progress Log [ 459.770516] 4. iLO Event Log so, robinhood is off the hook :-) Andrew _______________________________________________ robinhood-support mailing list robinhood-support@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/robinhood-support