On Tue, Apr 21, 2020 at 2:14 PM Bart Brashers via openindiana-discuss <
openindiana-discuss@openindiana.org> wrote:

> Hi everyone, I'm new to this listserv. I run Air Quality and
> Meteorological models on an HPC cluster, which is all CentOS except for one
> storage server running OpenIndiana (SunOS 5.11 oi_151a9 November 2013). I
> know, I know, it's a ridiculously old installation, please don't bug me
> about that.
>
> I would like to figure out what happened to that server this past weekend.
> My goal is to figure out if there's something I can do to avoid having the
> problem described below happen again.
>
> OpenIndiana is running on a Supermicro box, with a SAS attached JBOD,
> about 85 spinning disks in two ZFS pools, one of SAS disks the other of
> SATA disks. Periodically, when load gets too high, it becomes unresponsive
> for 5 - 30 minutes, but if we're patient enough it comes back. The load (as
> reported by /usr/bin/top) immediately after such an event is ~200, which
> rapidly falls back to a more normal range of ~0.5.
>
> Two days ago on Sunday evening, it went off into la-la land again, but
> after a few hours hadn't come back. The IMPI interface was also not
> responding, so I couldn't reboot it remotely. I went in to the office on
> Monday morning and shut down the server, then pulled the power cords for 20
> seconds. The complete removal of power often helps in situations like this,
> I've found.
>
> The server then entered an endless loop, where it would try to boot,
> timeout about 6 times (taking ~5 minutes for each timeout) with the
> following message, then kernel panic and reboot.
>
> Warning: /pci@0,0/pci8086,3c04@2/pci100,3020@0 (mpt_sas10):
>        Disconnected command timeout for Target 156
> ...repeat...
> panic[cpu0]/thread=ffffff01e80cbc40: I/O to pool 'pool0' appears to be
> hung.
>
> Great! This OS already has so many names for disks, here's another one:
> which disk is Target 156? Sometimes it was Target 75, sometimes it was
> Target 150. Or is that a SAS expander? I could not log in, it would never
> get that far before the kernel panic and reboot.
>
> I was able to boot into single-user mode (append -s to the grub line
> containing "kernel") and poked around until I found two disks that were
> reporting errors. fmdump -eV was useful, though so verbose it took a while
> to figure out what to read. The best/clearest method was echo | format,
> which is not a command I would have guessed based on decades of experience
> with Linux ;-). I pulled two bad disks, and rebooted... and it went back
> into the endless panic-reboot loop.
>
> I eventually found this page:
> https://docs.oracle.com/cd/E23824_01/html/821-1448/gbbwc.html#scrolltoc
> and followed this procedure:
>
>
>   *   When the booting gets to the grub stage, press e to edit
>   *   Scroll to the line containing "kernel" and press e again edit
>   *   At the end of the line, add the text -m milestone=none and press
> enter
>   *   Press b to boot
>   *   Login as root
>   *   (The root filesystem [mounted at /] was already read-write, not
> read-only, for me)
>   *   Rename /etc/zfs/zpool.cache to something else
>   *   Reboot (svcadm milestone all didn't work for me)
>   *   Login as root
>   *   Type zpool import and verify that all pools were able to be imported
>   *   Type zpool import -a and suddenly everything was back to normal!
>
> (Yes, I typed that all out so someone searching could find the
> step-by-step recipe when it happens to them, the link above is not great
> for beginners.)
>
Much appreciated.

>
> Any suggestions on what to look for, and where to look (which logs) would
> be greatly appreciated. Suggestions about upgrading or migrating to new
> hardware are not necessary, I already know. It's all about money - and with
> the GDP outlook for 2020 due to COVID-19, it's looking like I'll have to
> keep this server limping along a while longer.
>
It sounds like a hardware problem. Even if you can't upgrade, you may have
to replace something. I'm guessing one of the HBAs the 85 HDDs are
(presumably) connected to is bad. Also, just to be sure, check the SMART
status of the boot disk. Boot disk problems often result in a range of
apparently unrelated unusual system behavior.

A possible workaround is described for Linux here
<https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/>.
I'm not sure what the OI equivalent is, but I do know that's kinda
mitigated (but not completely solved) my Raspberry Pi 3B+ locking up too.
At least now it's able to recover itself (which I believe it what you want)
instead of requiring me to do a power cycle. Per the comments at that link,
those symptoms tend to be related to impending disk failure (see previous
boot disk comment.)

>
> Thanks,
>
> Bart
> _______________________________________________
> openindiana-discuss mailing list
> openindiana-discuss@openindiana.org
> https://openindiana.org/mailman/listinfo/openindiana-discuss
>
_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
https://openindiana.org/mailman/listinfo/openindiana-discuss

Reply via email to