On Wed, Feb 04, 2026 at 10:33:11AM -0800, Stanislav Kinsburskii wrote:
> On Wed, Feb 04, 2026 at 05:33:29AM +0000, Anirudh Rayabharam wrote:
> > On Tue, Feb 03, 2026 at 11:42:58AM -0800, Stanislav Kinsburskii wrote:
> > > On Tue, Feb 03, 2026 at 04:46:03PM +0000, Anirudh Rayabharam wrote:
> > > > On Tue, Feb 03, 2026 at 07:40:36AM -0800, Stanislav Kinsburskii wrote:
> > > > > On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote:
> > > > > > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii 
> > > > > > wrote:
> > > > > > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam 
> > > > > > > wrote:
> > > > > > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii 
> > > > > > > > wrote:
> > > > > > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam 
> > > > > > > > > wrote:
> > > > > > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav 
> > > > > > > > > > Kinsburskii wrote:
> > > > > > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh 
> > > > > > > > > > > Rayabharam wrote:
> > > > > > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav 
> > > > > > > > > > > > Kinsburskii wrote:
> > > > > > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh 
> > > > > > > > > > > > > Rayabharam wrote:
> > > > > > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav 
> > > > > > > > > > > > > > Kinsburskii wrote:
> > > > > > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh 
> > > > > > > > > > > > > > > Rayabharam wrote:
> > > > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, 
> > > > > > > > > > > > > > > > Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > > > > > The MSHV driver deposits kernel-allocated 
> > > > > > > > > > > > > > > > > pages to the hypervisor during
> > > > > > > > > > > > > > > > > runtime and never withdraws them. This 
> > > > > > > > > > > > > > > > > creates a fundamental incompatibility
> > > > > > > > > > > > > > > > > with KEXEC, as these deposited pages remain 
> > > > > > > > > > > > > > > > > unavailable to the new kernel
> > > > > > > > > > > > > > > > > loaded via KEXEC, leading to potential system 
> > > > > > > > > > > > > > > > > crashes upon kernel accessing
> > > > > > > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until 
> > > > > > > > > > > > > > > > > proper page lifecycle
> > > > > > > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Someone might want to stop all guest VMs and do 
> > > > > > > > > > > > > > > > a kexec. Which is valid
> > > > > > > > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > No, it won't work and hypervsisor depostied pages 
> > > > > > > > > > > > > > > won't be withdrawn.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > All pages that were deposited in the context of a 
> > > > > > > > > > > > > > guest partition (i.e.
> > > > > > > > > > > > > > with the guest partition ID), would be withdrawn 
> > > > > > > > > > > > > > when you kill the VMs,
> > > > > > > > > > > > > > right? What other deposited pages would be left?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The driver deposits two types of pages: one for the 
> > > > > > > > > > > > > guests (withdrawn
> > > > > > > > > > > > > upon gust shutdown) and the other - for the host 
> > > > > > > > > > > > > itself (never
> > > > > > > > > > > > > withdrawn).
> > > > > > > > > > > > > See hv_call_create_partition, for example: it 
> > > > > > > > > > > > > deposits pages for the
> > > > > > > > > > > > > host partition.
> > > > > > > > > > > > 
> > > > > > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory 
> > > > > > > > > > > > in module_exit?
> > > > > > > > > > > > Also, can't we forcefully kill all running partitions 
> > > > > > > > > > > > in module_exit and
> > > > > > > > > > > > then reclaim memory? Would this help with kernel 
> > > > > > > > > > > > consistency
> > > > > > > > > > > > irrespective of userspace behavior?
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > It would, but this is sloppy and cannot be a long-term 
> > > > > > > > > > > solution.
> > > > > > > > > > > 
> > > > > > > > > > > It is also not reliable. We have no hook to prevent 
> > > > > > > > > > > kexec. So if we fail
> > > > > > > > > > > to kill the guest or reclaim the memory for any reason, 
> > > > > > > > > > > the new kernel
> > > > > > > > > > > may still crash.
> > > > > > > > > > 
> > > > > > > > > > Actually guests won't be running by the time we reach our 
> > > > > > > > > > module_exit
> > > > > > > > > > function during a kexec. Userspace processes would've been 
> > > > > > > > > > killed by
> > > > > > > > > > then.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > No, they will not: "kexec -e" doesn't kill user processes.
> > > > > > > > > We must not rely on OS to do graceful shutdown before doing
> > > > > > > > > kexec.
> > > > > > > > 
> > > > > > > > I see kexec -e is too brutal. Something like systemctl kexec is
> > > > > > > > more graceful and is probably used more commonly. In this case 
> > > > > > > > at least
> > > > > > > > we could register a reboot notifier and attempt to clean things 
> > > > > > > > up.
> > > > > > > > 
> > > > > > > > I think it is better to support kexec to this extent rather than
> > > > > > > > disabling it entirely.
> > > > > > > > 
> > > > > > > 
> > > > > > > You do understand that once our kernel is released to third 
> > > > > > > parties, we
> > > > > > > can’t control how they will use kexec, right?
> > > > > > 
> > > > > > Yes, we can't. But that's okay. It is fine for us to say that only 
> > > > > > some
> > > > > > kexec scenarios are supported and some aren't (iff you're creating 
> > > > > > VMs
> > > > > > using MSHV; if you're not creating VMs all of kexec is supported).
> > > > > > 
> > > > > 
> > > > > Well, I disagree here. If we say the kernel supports MSHV, we must
> > > > > provide a robust solution. A partially working solution is not
> > > > > acceptable. It makes us look careless and can damage our reputation 
> > > > > as a
> > > > > team (and as a company).
> > > > 
> > > > It won't if we call out upfront what is supported and what is not.
> > > > 
> > > > > 
> > > > > > > 
> > > > > > > This is a valid and existing option. We have to account for it. 
> > > > > > > Yet
> > > > > > > again, L1VH will be used by arbitrary third parties out there, 
> > > > > > > not just
> > > > > > > by us.
> > > > > > > 
> > > > > > > We can’t say the kernel supports MSHV until we close these gaps. 
> > > > > > > We must
> > > > > > 
> > > > > > We can. It is okay say some scenarios are supported and some aren't.
> > > > > > 
> > > > > > All kexecs are supported if they never create VMs using MSHV. If 
> > > > > > they do
> > > > > > create VMs using MSHV and we implement cleanup in a reboot notifier 
> > > > > > at
> > > > > > least systemctl kexec and crashdump kexec would which are probably 
> > > > > > the
> > > > > > most common uses of kexec. It's okay to say that this is all we 
> > > > > > support
> > > > > > as of now.
> > > > > > 
> > > > > 
> > > > > I'm repeating myself, but I'll try to put it differently.
> > > > > There won't be any kernel core collected if a page was deposited. 
> > > > > You're
> > > > > arguing for a lost cause here. Once a page is allocated and deposited,
> > > > > the crash kernel will try to write it into the core.
> > > > 
> > > > That's why we have to implement something where we attempt to destroy
> > > > partitions and reclaim memory (and BUG() out if that fails; which
> > > > hopefully should happen very rarely if at all). This should be *the*
> > > > solution we work towards. We don't need a temporary disable kexec
> > > > solution.
> > > > 
> > > 
> > > No, the solution is to preserve the shared state and pass it over via KHO.
> > 
> > Okay, then work towards it without doing temporary KEXEC disable. We can
> > call out that kexec is not supported until then. Disabling KEXEC is too
> > intrusive.
> > 
> 
> What do you mean by "too intrusive"? The change if local to driver's
> Kconfig. There are no verbal "callouts" in upstream Linux - that's
> exactly what Kconfig is used for. Once the proper solution is
> implemented, we can remove the restriction.
> 
> > Is there any precedent for this? Do you know if any driver ever disabled
> > KEXEC this way?
> > 
> 
> No, but there is no other similar driver like this one.

Doesn't have to be like this one. There could be issues with device
states during kexec state.

> Why does it matter though?

To learn from past precedents.

> 
> > > 
> > > > > 
> > > > > > Also, what makes you think customers would even be interested in 
> > > > > > enabling
> > > > > > our module in their kernel configs if it takes away kexec?
> > > > > > 
> > > > > 
> > > > > It's simple: L1VH isn't a host, so I can spin up new VMs instead of
> > > > > servicing the existing ones.
> > > > 
> > > > And what about the L2 VM state then? They might not be throwaway in all
> > > > cases.
> > > > 
> > > 
> > > L2 guest can (and likely will) be migrated fromt he old L1VH to the new
> > > one.
> > > And this is most likely the current scenario customers are using.
> > > 
> > > > > 
> > > > > Why do you think there won’t be customers interested in using MSHV in
> > > > > L1VH without kexec support?
> > > > 
> > > > Because they could already be using kexec for their servicing needs or
> > > > whatever. And no we can't just say "don't service these VMs just spin up
> > > > new ones".
> > > > 
> > > 
> > > Are you speculating or know for sure?
> > 
> > It's a reasonable assumption that people are using kexec for servicing.
> > 
> 
> Again, using kexec for servicing is not supported: why pretending it is?

What this patch effectively asserts is that kexec is unsupported whenever the
MSHV driver is enabled. But that is not accurate. Enabling MSHV does not
necessarily imply that it is being used. The correct statement is that kexec is
unsupported only when MSHV is *in use*, i.e. when one or more VMs are
running.

By disabling kexec unconditionally, the patch prevents a valid workflow in
situations where no VMs exist and kexec would work without issue. This imposes a
blanket restriction instead of enforcing the actual requirement.

And sure, I understand there is no way to enforce that actual
requirement. So this is what I propose:

The statement "kexec is not supported when the MSHV driver is used" can be
documented on docs.microsoft.com once direct virtualization becomes broadly
available. The documentation can also provide operational guidance, such as
shutting down all VMs before invoking kexec for servicing. This preserves a
practical path for users who rely on kexec. If kexec is disabled entirely, that
flexibility is lost.

The stricter approach ensures users cannot accidentally make a mistake, which
has its merits. However, my approach gives more power and discretion to
the user. In parallel, we of course continue to work on making it
robust.

> 
> > > 
> > > > Also, keep in mind that once L1VH is available in Azure, the distros
> > > > that run on it would be the same distros that run on all other Azure
> > > > VMs. There won't be special distros with a kernel specifically built for
> > > > L1VH. And KEXEC is generally enabled in distros. Distro vendors won't be
> > > > happy that they would need to publish a separate version of their image 
> > > > with
> > > > MSHV_ROOT enabled and KEXEC disabled because they wouldn't want KEXEC to
> > > > be disabled for all Azure VMs. Also, the customers will be confused why
> > > > the same distro doesn't work on L1VH.
> > > > 
> > > 
> > > I don't think distro happiness is our concern. They already build custom
> > 
> > If distros are not happy they won't package this and consequently
> > nobody will use it.
> > 
> 
> Could you provide an example of such issues in the past?
> 
> > > versions for Azure. They can build another custom version for L1VH if
> > > needed.
> > 
> > We should at least check if they are ready to do this.
> > 
> 
> This is a labor intrusive and long-term check. Unless there is a solid
> evidence that they won't do it, I don't see the point in doing this.

It is reasonable to assume that maintaining an additional flavor of a
distro is an overhead (maintain new package(s), maintain Azure
marketplace images etc etc). This should be enough reason to check. Not
everything needs a solid evidence. Often times a reasonable suspiscion
will do.

Thanks,
Anirudh.

> 
> Thanks,
> Stanislav
> 
> > Thanks,
> > Anirudh.
> > 
> > > 
> > > Anyway, I don't see the point in continuing this discussion. All points
> > > have been made, and solutions have been proposed.
> > > 
> > > If you can come up with something better in the next few days, so we at
> > > least have a chance to get it merged in the next merge window, great. If
> > > not, we should explicitly forbid the unsupported feature and move on.
> > > 
> > > Thanks,
> > > Thanks,
> > > Stanislav
> > > 
> > > > Thanks,
> > > > Anirudh.

Reply via email to