Re: [PATCH] migration/docs: Explain two solutions for VMSD compatibility

Fabiano Rosas Tue, 30 Jan 2024 05:59:36 -0800

Peter Xu <pet...@redhat.com> writes:

> On Mon, Jan 29, 2024 at 10:44:46AM -0300, Fabiano Rosas wrote:
>> > Since we're at it, I would also like to know how you think about whether we
>> > should still suggest people using VMSD versioning, as we know that it won't
>> > work for backward migrations.
>> >
>> > My current thoughts is it is still fine, as it's easier to use, and it
>> > should still be applicable to the cases where a strict migration semantics
>> > are not required.  However it's hard to justify which device needs that
>> > strictness.
>> 
>> I'd prefer if we kept things strict. However I don't think we can do
>> that without having enough testing and specially, clear recipes on how
>> to add compatibility back once it gets lost. Think of that recent thread
>
> If it was broken, IMHO we should just fix it and backport to stable.


(tangent)
Sure, but I'm talking about how do we instruct device developers on
fixing migration bugs. We cannot simply yell "regression!" and expect
people to care.

Once something breaks there's no easy way to determine what's the right
fix. It will always involve copying the migration maintainers and some
back and forth with the device people before we reach an agreement on
what's even broken.

When I say "clear recipes" what I mean is we'd have a "catalogue" of
types of failures that could happen. Those would be both documented in
plain english and also have some instrumentation in the code to produce
a clear error/message.

  E.g.: "Device 'foo' failed to migrate because of error type X: the src
  machine provided more state than the dst was expecting around the
  value Y".

And that "error type X" would come with some docs listing examples of
other similar errors and what strategies we suggest do deal with them.

Currently most migration failures are just a completely helpless:
"blergh, error -5". And the only thing we can say about it upfront is
"well, something must have changed in the stream".

Real migration failures I have seen recently (all fixed already):

1- Some feature bit was mistakenly removed from an arm cpu. Migration
   complains about a 'length' field being different.

2- A group of devices was moved from the machine init to the cpu init on
   pseries. Migration spews some nonsense about an "index".

3- Recent (invalid) bug on -cpu max on arm, a couple of bits were set in
   a register. Migration barfs incomprehensibly with: "error while
   loading state for instance 0x0 of device 'cpu', Operation not
   permitted".

So I bet we could improve these error cases to be a bit more predictable
and that would help device developers to be able to maintain migration
compatibility without making it seem like an arbitrary, hard to achieve
requirement.
(/tangent)

>
> I think Juan used to worry on what happens if someone already used an old
> version of old release, e.g., someone using 8.2.0 may not be able to
> migrate to 8.2.1 if we fix that breakage in 9.0 and backport that to 8.2.1.
> My take is that maybe that's overcomplicated, and maybe we should simply
> only maintain the latest stable version, rather than all.  In this case,
> IMHO it will be less burden if we only guarantee 8.2.1 will be working,
> e.g., when migrating from 8.1.z -> 8.2.1.  Then we should just state a
> known issue in 8.2.0 that it is broken, and both:
>
>   (1) 8.1.z -> 8.2.0, and

Fair enough.

>   (2) 8.2.0 -> 8.2.1

Do you think we may not be able to always ensure that the user can get
out of the broken version? Or do you simply think that's too much work?

I think I agree with you. It's better to have a clear statement of what
we support and make sure that works, rather than having _some_ scenarios
where the user _may_ need to shutdown the VM and _some_ where they _may_
be able to migrate out of the situation. It creates a confusing message
that I imagine would just cause people to avoid using migration
altogether.

> will expect to fail.
>
>> were we discussed an old powerpc issue. How come we can see the fix
>> today in the code but cannot tell which problem it was trying to solve?
>> That's bonkers. Ideally every type of breakage would have a mapping into
>> why it breaks and how to fix it.
>> 
>> So with testing to catch the issue early and a clear step-by-step on how
>> to identify and fix compatibility, then we could require strict
>> compatibility for every device.
>
> I don't think we can guarantee no bug there, but indeed we can do better on
> providing some test framework for device VMSDs.
>
>> 
>> >
>> > For example, any device to be used in migration-test must be forward +
>> > backward migration compatible at least, because you just added the n-1
>> > regression tests to cover both directions.  Said that, only a few devices
>> > are involved because currently our migration-test qemu cmdline is pretty
>> > simple.
>> 
>> We might want to make a distinction between migration core vs. device
>> state testing. I see n-1 testing more like migration core testing. It's
>> bad to break migration, but it's really bad to break migration for
>> everyone because we refactored something deep within migration/.
>> 
>> I also wouldn't mind if we had some simple way for device developers to
>> add migration tests that cover their code. Currently it's infeasible to
>> edit migration-test with new command lines for every device of
>> interest. Maybe we could have a little framework that takes a command
>> line and spits a migration stream? Something really self-contained,
>> behind the device's CONFIG in meson.
>
> I added one more todo:
>
> https://wiki.qemu.org/ToDo/LiveMigration#Device_migration_stream_test_framework
>
> How's that look?  Feel free to modify on your will.

Looks good.

The point about the guest behavior influence is something that Juan has
mentioned as a blocker for testing with static data. I don't think it
would be impossible to have some unit testing at the vmstate with some
artificial values, but it might be too much work to be worth it.

Re: [PATCH] migration/docs: Explain two solutions for VMSD compatibility

Reply via email to