On Thu, Nov 10, 2022 at 11:29:21AM -0500, Michael S. Tsirkin wrote: > On Tue, Nov 01, 2022 at 01:03:17PM +0000, Daniel P. Berrangé wrote: > > On Tue, Nov 01, 2022 at 01:57:24PM +0100, Igor Mammedov wrote: > > > On Mon, 31 Oct 2022 11:48:58 -0400 > > > "Michael S. Tsirkin" <m...@redhat.com> wrote: > > > > > > > On Mon, Oct 31, 2022 at 01:50:24PM +0000, Daniel P. Berrangé wrote: > > > > > On Mon, Oct 31, 2022 at 01:19:30PM +0000, Daniel P. Berrangé wrote: > > > > > > The TCO watchdog is unconditionally integrated into the Q35 machine > > > > > > type by default, but at the same time is unconditionally disabled > > > > > > from firing by a host config option that overrides guest OS attempts > > > > > > to enable it. People have to know to set a magic -global to make > > > > > > it non-broken > > > > > > > > > > Incidentally I found that originally the TCO watchdog was not > > > > > unconditionally enabled. Its exposure to the guest could be > > > > > turned on/off using > > > > > > > > > > -global ICH9-LPC.enable_tco=bool > > > > > > > > > > This was implemented for machine type compat, but it also gave > > > > > apps a way to disable the watchdog functionality. Unfortunately > > > > > that ability was discarded in this series: > > > > > > > > > > > > > > > https://lore.kernel.org/all/1453564933-29638-1-git-send-email-ehabk...@redhat.com/ > > > > > > > > > > but the 'enable_tco' property still exists in QOM, but silently > > > > > ignored. > > > > > > > > > > Seems we should either fix the impl of 'enable_tco', or remove the > > > > > QOM property entirely, so we don't pretend it can be toggled anymore. > > > > > > > > > > With regards, > > > > > Daniel > > > > > > > > i am inclined to say you are right and the fix is to fix the impl. > > > > > > Is there need for users to disable whatchdog at all? > > > It was always present since then and no one complained, > > > so perhaps we should ditch property instead fixing it > > > to keep it simple. > > > > Thinking about it more, I think we should NOT fix the 'enable_tco' property, > > because there will be no way for a mgmt appp to tell if they're using a > > fixed or broken QEMU. > > This is always the case for any bug. We don't as a rule add properties > for this, it is distro's responsibility to pick bugfixes for > features users care about. > > What makes this bug different?
It is impossible to safely use the enable_tco property because it will break ABI across live migrations between old and new QEMU. If it was a recent bug I'd be ok with just fixing it. Since this property has been broken for 6 years though, IMHO it just needs to be deleted entirely. > > So if they use 'enable_tco' on a broken QEMU and then > > live migrate, they'll get an guest ABI change. If we did want to support > > disabling it, then we should have a brand new property that apps can probe > > for. > > > > In the absence of a request to disable watchdog, I'd say we just delete > > 'enable_tco' right now. If someone wants it in future, we can add it with > > a new name. > > I am really stressed about watchdog firing and resetting VMs that previously > were working fine. Adding this without a safety mechanism to quickly > disable in case of problems in the field is not wise imho. We would only toggle the ICH9-LPC.noreboot flag for new machine types, so wouldn't affect existing VMs, and if anyone has a problem they can explicitly set ICH9-LPC.noreboot back to true, or pick the older machine type again. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|