Peter Xu <pet...@redhat.com> writes: > On Mon, Jan 29, 2024 at 10:44:46AM -0300, Fabiano Rosas wrote: >> > Since we're at it, I would also like to know how you think about whether we >> > should still suggest people using VMSD versioning, as we know that it won't >> > work for backward migrations. >> > >> > My current thoughts is it is still fine, as it's easier to use, and it >> > should still be applicable to the cases where a strict migration semantics >> > are not required. However it's hard to justify which device needs that >> > strictness. >> >> I'd prefer if we kept things strict. However I don't think we can do >> that without having enough testing and specially, clear recipes on how >> to add compatibility back once it gets lost. Think of that recent thread > > If it was broken, IMHO we should just fix it and backport to stable.
(tangent) Sure, but I'm talking about how do we instruct device developers on fixing migration bugs. We cannot simply yell "regression!" and expect people to care. Once something breaks there's no easy way to determine what's the right fix. It will always involve copying the migration maintainers and some back and forth with the device people before we reach an agreement on what's even broken. When I say "clear recipes" what I mean is we'd have a "catalogue" of types of failures that could happen. Those would be both documented in plain english and also have some instrumentation in the code to produce a clear error/message. E.g.: "Device 'foo' failed to migrate because of error type X: the src machine provided more state than the dst was expecting around the value Y". And that "error type X" would come with some docs listing examples of other similar errors and what strategies we suggest do deal with them. Currently most migration failures are just a completely helpless: "blergh, error -5". And the only thing we can say about it upfront is "well, something must have changed in the stream". Real migration failures I have seen recently (all fixed already): 1- Some feature bit was mistakenly removed from an arm cpu. Migration complains about a 'length' field being different. 2- A group of devices was moved from the machine init to the cpu init on pseries. Migration spews some nonsense about an "index". 3- Recent (invalid) bug on -cpu max on arm, a couple of bits were set in a register. Migration barfs incomprehensibly with: "error while loading state for instance 0x0 of device 'cpu', Operation not permitted". So I bet we could improve these error cases to be a bit more predictable and that would help device developers to be able to maintain migration compatibility without making it seem like an arbitrary, hard to achieve requirement. (/tangent) > > I think Juan used to worry on what happens if someone already used an old > version of old release, e.g., someone using 8.2.0 may not be able to > migrate to 8.2.1 if we fix that breakage in 9.0 and backport that to 8.2.1. > My take is that maybe that's overcomplicated, and maybe we should simply > only maintain the latest stable version, rather than all. In this case, > IMHO it will be less burden if we only guarantee 8.2.1 will be working, > e.g., when migrating from 8.1.z -> 8.2.1. Then we should just state a > known issue in 8.2.0 that it is broken, and both: > > (1) 8.1.z -> 8.2.0, and Fair enough. > (2) 8.2.0 -> 8.2.1 Do you think we may not be able to always ensure that the user can get out of the broken version? Or do you simply think that's too much work? I think I agree with you. It's better to have a clear statement of what we support and make sure that works, rather than having _some_ scenarios where the user _may_ need to shutdown the VM and _some_ where they _may_ be able to migrate out of the situation. It creates a confusing message that I imagine would just cause people to avoid using migration altogether. > will expect to fail. > >> were we discussed an old powerpc issue. How come we can see the fix >> today in the code but cannot tell which problem it was trying to solve? >> That's bonkers. Ideally every type of breakage would have a mapping into >> why it breaks and how to fix it. >> >> So with testing to catch the issue early and a clear step-by-step on how >> to identify and fix compatibility, then we could require strict >> compatibility for every device. > > I don't think we can guarantee no bug there, but indeed we can do better on > providing some test framework for device VMSDs. > >> >> > >> > For example, any device to be used in migration-test must be forward + >> > backward migration compatible at least, because you just added the n-1 >> > regression tests to cover both directions. Said that, only a few devices >> > are involved because currently our migration-test qemu cmdline is pretty >> > simple. >> >> We might want to make a distinction between migration core vs. device >> state testing. I see n-1 testing more like migration core testing. It's >> bad to break migration, but it's really bad to break migration for >> everyone because we refactored something deep within migration/. >> >> I also wouldn't mind if we had some simple way for device developers to >> add migration tests that cover their code. Currently it's infeasible to >> edit migration-test with new command lines for every device of >> interest. Maybe we could have a little framework that takes a command >> line and spits a migration stream? Something really self-contained, >> behind the device's CONFIG in meson. > > I added one more todo: > > https://wiki.qemu.org/ToDo/LiveMigration#Device_migration_stream_test_framework > > How's that look? Feel free to modify on your will. Looks good. The point about the guest behavior influence is something that Juan has mentioned as a blocker for testing with static data. I don't think it would be impossible to have some unit testing at the vmstate with some artificial values, but it might be too much work to be worth it.