On Thu, Jul 26, 2018 at 07:30:34AM -0500, Michael Roth wrote: > Quoting David Gibson (2018-07-26 00:07:46) > > On Thu, May 03, 2018 at 11:20:44PM -0500, Michael Roth wrote: > > > In some cases (e.g. spapr) we record guest timebase after qmp_stop() > > > via a runstate hook so we can restore it on qmp_cont(). If a migration > > > occurs in between those events we end up saving it again, this time > > > based on the current timebase the guest would be seeing had it been > > > running. This has the effect of advancing the guest timebase while > > > it is stopped, which is not what the code intends. > > > > > > Other than simple jumps in time, this has been seen to trigger what > > > appear to be RCU-related crashes in recent kernels when the advance > > > exceeds rcu_cpu_stall_timeout, and it can be triggered by fairly > > > common operations such as `virsh migrate ... --timeout 60`. > > > > > > Cc: Alexey Kardashevskiy <a...@ozlabs.ru> > > > Cc: David Gibson <da...@gibson.dropbear.id.au> > > > Cc: Laurent Vivier <lviv...@redhat.com> > > > Cc: qemu-...@nongnu.org > > > Cc: qemu-sta...@nongnu.org > > > Signed-off-by: Michael Roth <mdr...@linux.vnet.ibm.com> > > > > Sorry, this fell off my radar for ages, but I've finally had a chance > > to look at it properly. > > No problem, I had assumed it just wasn't deemed needed based on the > discussion. > > > > > I'm not totally convinced this handle all the possible edge cases > > correctly, but I am convinced it gives behaviour that's more correct > > than we have now. It doesn't introduce changes to the interface or > > migration stream that would break things in future, so I've applied it > > to ppc-for-3.1. > > There's a flaw with the patch that Greg pointed out to me previously > in that it doesn't just avoid the presave hook when vm_stop is issued > manually via monitor, but also when vm_stop is issued by the migration > code itself. The latter wasn't intentional. I'm not sure if we currently > have a good way to distinguish between the 2 cases to rectify that. > > As the patch currently stands it is effectively a revert of the > Laurent's original patch. > > FWIW, the RCU-related crashes mentioned in the original commit turned > out to be a separate issue that was exacerbated by RCU stall warnings > messages (which *would* be less likely with this patch). So this would > be more about user experience and spurious log messages than an actual > known bug fix. I am still of the opinion that we shouldn't resave the > clock if the vm_stop was manually-issued; it's just that the current > patch oversteps that goal.
Right, I agree. Migrating shouldn't advance the time if we've already explicitly stopped. But it's not really clear how to accomplish that :/. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature