I have CoreOS running in a VM on Hyper-V. It's using the Stable channel. 
Sometimes when it reboots automatically to apply an update, it never comes 
back up and when I check it later the VM is switched off. When the VM is 
then started manually, it's back in the old version. This is happening with 
the latest update, from 2023.5.0 to 2079.3.0. It had similar problems 
updating to 1967.4.0 and 1911.3.0, but handled all the others OK.

Unfortunately, I don't have access to the host that's running Hyper-V. The 
host is owned by my client, so I have to go through its IT staff, and they 
can't seem to find out what's wrong or give me much information. I've asked 
for a screen capture of the reboot process, but so far they haven't been 
able to give me that.

My suspicion is that the new CoreOS version fails to boot and the VM 
reboots into the old version almost immediately, and that Hyper-V sees this 
as an instability and shuts the machine off. However, it doesn't always 
happen exactly like this. Sometimes I see multiple reboots during the 
locksmith reboot window, all of them into the old version, which then goes 
through the update and reboot process within a few minutes. Perhaps it 
depends on how long it takes for the new version to crash, and if it's 
longer than some threshold Hyper-V doesn't shut off the machine. I usually 
don't see any logs, even partial, for the new version, so I assume it's 
dying very early in the startup process, probably before the root is 
switched from the initrd.

I realize I really need to get the client's staff to give me better 
information, but I don't have much control over that. Their view is that 
it's something to do with their Windows Server 2012 being set up for 
Cluster Aware Updating. They think that when the VM reboots the host is 
migrating it to a different host in the cluster, and if that we could just 
migrate the VM once and for all to a non-clustered infrastructure the 
problem would go away. Migrating the VM is going to be a bit of a hassle 
for me, because the networking will change and last time it took a while 
for the client staff to get their NAT working properly. So I'd much prefer 
it if there was a way to fix the problem without doing that.

Reply via email to