On Jun 9, 2023, at 9:12 AM, Jon Tibble via illumos-discuss 
<[email protected]> wrote:
> 
> Hi,
> 
> I've got a stand-alone SmartOS/Triton hypervisor that's been having some 
> unscheduled, seemingly random, reboots.  This feels more core though which is 
> why I'm raising it here rather than the SmartOS ML.
> 
> This is a custom build of SmartOS 20220310T004022Z.  The two changes we apply 
> are to SEGPDEFSIZE (to 8G) and PORT_DEFAULT_PORTS (to 0x08000). This image is 
> running without any issue on multiple other machines.
> 
> There is no evidence of anything in either syslog, messages, auth.log nor the 
> BMC logs.  There is nothing in /var/crash.  fmadm faulty is clean both before 
> and after.
> There is however some weird logging in last where it appears as if the system 
> went down, or was told to go down, hours before it happened.
> 
> Entries like:
> reboot  system boot             Thu Jun  8 00:22
> reboot  system down             Wed Jun  7 15:49
> and
> reboot  system boot             Sat May 27 23:58
> reboot  system down             Fri May 26 13:30
> 
> Both these reboots actually occurred a minute or so before the system boot 
> timestamp so the reboots only took a minute and there was no end-user impact 
> at that time of night.  The system is just over a year old and the BMC logs 
> show good time and a scheduled shutdown and boot logged the correct time so 
> I've no reason to think it's a BIOS battery or hardware clock issue.
> 
> I wondered if this could happen if someone scheduled a reboot with a long 
> timeout on a shutdown command (and did it for the GZ rather than a zone by 
> accident) but I've been unable to replicate the last entries in my 
> experiments with both /usr/sbin/shutdown nor /usr/ucb/shutdown.  One of the 
> shutdowns came close as it didn't leave any log evidence but didn't replicate 
> those weird last entries.
> 
> Has anyone seen this before?

What exactly is the HW on this machine?  My home server, which runs OmniOS, has 
spontaneously rebooted as you describe when I'm in the middle of "zpool scrub", 
or in a full backup, of my data pool (2x14TB mirrored HDD). It's now not doing 
that because I removed what turns out to be a faulty drive. My problem might 
also be a faulty cable, or a faulty sata port on my motherboard.  I'm also not 
discounting the possibility of bad airflow around the drives causing 
overheating. Any one of what I just stated could be eliminated by removing one 
drive.

Grep for "ahci" in your /var/adm/messages near the times of rebooting.  I saw 
these occasionally around my spontaneous reboots:

May 21 03:52:52 hdc ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog 
port 7 satapkt 0xfffffeb4e1022420 timed out

It's a long shot, and I still haven't found a 100% satisfactory answer yet 
myself.

Dan


------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T9b0e3a6300508f9b-Ma0b2dcca9edc2c5bca66517f
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Reply via email to