On Jun 9, 2023, at 9:12 AM, Jon Tibble via illumos-discuss <[email protected]> wrote: > > Hi, > > I've got a stand-alone SmartOS/Triton hypervisor that's been having some > unscheduled, seemingly random, reboots. This feels more core though which is > why I'm raising it here rather than the SmartOS ML. > > This is a custom build of SmartOS 20220310T004022Z. The two changes we apply > are to SEGPDEFSIZE (to 8G) and PORT_DEFAULT_PORTS (to 0x08000). This image is > running without any issue on multiple other machines. > > There is no evidence of anything in either syslog, messages, auth.log nor the > BMC logs. There is nothing in /var/crash. fmadm faulty is clean both before > and after. > There is however some weird logging in last where it appears as if the system > went down, or was told to go down, hours before it happened. > > Entries like: > reboot system boot Thu Jun 8 00:22 > reboot system down Wed Jun 7 15:49 > and > reboot system boot Sat May 27 23:58 > reboot system down Fri May 26 13:30 > > Both these reboots actually occurred a minute or so before the system boot > timestamp so the reboots only took a minute and there was no end-user impact > at that time of night. The system is just over a year old and the BMC logs > show good time and a scheduled shutdown and boot logged the correct time so > I've no reason to think it's a BIOS battery or hardware clock issue. > > I wondered if this could happen if someone scheduled a reboot with a long > timeout on a shutdown command (and did it for the GZ rather than a zone by > accident) but I've been unable to replicate the last entries in my > experiments with both /usr/sbin/shutdown nor /usr/ucb/shutdown. One of the > shutdowns came close as it didn't leave any log evidence but didn't replicate > those weird last entries. > > Has anyone seen this before?
What exactly is the HW on this machine? My home server, which runs OmniOS, has spontaneously rebooted as you describe when I'm in the middle of "zpool scrub", or in a full backup, of my data pool (2x14TB mirrored HDD). It's now not doing that because I removed what turns out to be a faulty drive. My problem might also be a faulty cable, or a faulty sata port on my motherboard. I'm also not discounting the possibility of bad airflow around the drives causing overheating. Any one of what I just stated could be eliminated by removing one drive. Grep for "ahci" in your /var/adm/messages near the times of rebooting. I saw these occasionally around my spontaneous reboots: May 21 03:52:52 hdc ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 7 satapkt 0xfffffeb4e1022420 timed out It's a long shot, and I still haven't found a 100% satisfactory answer yet myself. Dan ------------------------------------------ illumos: illumos-discuss Permalink: https://illumos.topicbox.com/groups/discuss/T9b0e3a6300508f9b-Ma0b2dcca9edc2c5bca66517f Delivery options: https://illumos.topicbox.com/groups/discuss/subscription
