On Wed, Feb 04, 2026 at 05:33:29AM +0000, Anirudh Rayabharam wrote:
> On Tue, Feb 03, 2026 at 11:42:58AM -0800, Stanislav Kinsburskii wrote:
> > On Tue, Feb 03, 2026 at 04:46:03PM +0000, Anirudh Rayabharam wrote:
> > > On Tue, Feb 03, 2026 at 07:40:36AM -0800, Stanislav Kinsburskii wrote:
> > > > On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote:
> > > > > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > > > > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii 
> > > > > > > wrote:
> > > > > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam 
> > > > > > > > wrote:
> > > > > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav 
> > > > > > > > > Kinsburskii wrote:
> > > > > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh 
> > > > > > > > > > Rayabharam wrote:
> > > > > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav 
> > > > > > > > > > > Kinsburskii wrote:
> > > > > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh 
> > > > > > > > > > > > Rayabharam wrote:
> > > > > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav 
> > > > > > > > > > > > > Kinsburskii wrote:
> > > > > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh 
> > > > > > > > > > > > > > Rayabharam wrote:
> > > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, 
> > > > > > > > > > > > > > > Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages 
> > > > > > > > > > > > > > > > to the hypervisor during
> > > > > > > > > > > > > > > > runtime and never withdraws them. This creates 
> > > > > > > > > > > > > > > > a fundamental incompatibility
> > > > > > > > > > > > > > > > with KEXEC, as these deposited pages remain 
> > > > > > > > > > > > > > > > unavailable to the new kernel
> > > > > > > > > > > > > > > > loaded via KEXEC, leading to potential system 
> > > > > > > > > > > > > > > > crashes upon kernel accessing
> > > > > > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until 
> > > > > > > > > > > > > > > > proper page lifecycle
> > > > > > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Someone might want to stop all guest VMs and do a 
> > > > > > > > > > > > > > > kexec. Which is valid
> > > > > > > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > No, it won't work and hypervsisor depostied pages 
> > > > > > > > > > > > > > won't be withdrawn.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > All pages that were deposited in the context of a 
> > > > > > > > > > > > > guest partition (i.e.
> > > > > > > > > > > > > with the guest partition ID), would be withdrawn when 
> > > > > > > > > > > > > you kill the VMs,
> > > > > > > > > > > > > right? What other deposited pages would be left?
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > The driver deposits two types of pages: one for the 
> > > > > > > > > > > > guests (withdrawn
> > > > > > > > > > > > upon gust shutdown) and the other - for the host itself 
> > > > > > > > > > > > (never
> > > > > > > > > > > > withdrawn).
> > > > > > > > > > > > See hv_call_create_partition, for example: it deposits 
> > > > > > > > > > > > pages for the
> > > > > > > > > > > > host partition.
> > > > > > > > > > > 
> > > > > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in 
> > > > > > > > > > > module_exit?
> > > > > > > > > > > Also, can't we forcefully kill all running partitions in 
> > > > > > > > > > > module_exit and
> > > > > > > > > > > then reclaim memory? Would this help with kernel 
> > > > > > > > > > > consistency
> > > > > > > > > > > irrespective of userspace behavior?
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > It would, but this is sloppy and cannot be a long-term 
> > > > > > > > > > solution.
> > > > > > > > > > 
> > > > > > > > > > It is also not reliable. We have no hook to prevent kexec. 
> > > > > > > > > > So if we fail
> > > > > > > > > > to kill the guest or reclaim the memory for any reason, the 
> > > > > > > > > > new kernel
> > > > > > > > > > may still crash.
> > > > > > > > > 
> > > > > > > > > Actually guests won't be running by the time we reach our 
> > > > > > > > > module_exit
> > > > > > > > > function during a kexec. Userspace processes would've been 
> > > > > > > > > killed by
> > > > > > > > > then.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > No, they will not: "kexec -e" doesn't kill user processes.
> > > > > > > > We must not rely on OS to do graceful shutdown before doing
> > > > > > > > kexec.
> > > > > > > 
> > > > > > > I see kexec -e is too brutal. Something like systemctl kexec is
> > > > > > > more graceful and is probably used more commonly. In this case at 
> > > > > > > least
> > > > > > > we could register a reboot notifier and attempt to clean things 
> > > > > > > up.
> > > > > > > 
> > > > > > > I think it is better to support kexec to this extent rather than
> > > > > > > disabling it entirely.
> > > > > > > 
> > > > > > 
> > > > > > You do understand that once our kernel is released to third 
> > > > > > parties, we
> > > > > > can’t control how they will use kexec, right?
> > > > > 
> > > > > Yes, we can't. But that's okay. It is fine for us to say that only 
> > > > > some
> > > > > kexec scenarios are supported and some aren't (iff you're creating VMs
> > > > > using MSHV; if you're not creating VMs all of kexec is supported).
> > > > > 
> > > > 
> > > > Well, I disagree here. If we say the kernel supports MSHV, we must
> > > > provide a robust solution. A partially working solution is not
> > > > acceptable. It makes us look careless and can damage our reputation as a
> > > > team (and as a company).
> > > 
> > > It won't if we call out upfront what is supported and what is not.
> > > 
> > > > 
> > > > > > 
> > > > > > This is a valid and existing option. We have to account for it. Yet
> > > > > > again, L1VH will be used by arbitrary third parties out there, not 
> > > > > > just
> > > > > > by us.
> > > > > > 
> > > > > > We can’t say the kernel supports MSHV until we close these gaps. We 
> > > > > > must
> > > > > 
> > > > > We can. It is okay say some scenarios are supported and some aren't.
> > > > > 
> > > > > All kexecs are supported if they never create VMs using MSHV. If they 
> > > > > do
> > > > > create VMs using MSHV and we implement cleanup in a reboot notifier at
> > > > > least systemctl kexec and crashdump kexec would which are probably the
> > > > > most common uses of kexec. It's okay to say that this is all we 
> > > > > support
> > > > > as of now.
> > > > > 
> > > > 
> > > > I'm repeating myself, but I'll try to put it differently.
> > > > There won't be any kernel core collected if a page was deposited. You're
> > > > arguing for a lost cause here. Once a page is allocated and deposited,
> > > > the crash kernel will try to write it into the core.
> > > 
> > > That's why we have to implement something where we attempt to destroy
> > > partitions and reclaim memory (and BUG() out if that fails; which
> > > hopefully should happen very rarely if at all). This should be *the*
> > > solution we work towards. We don't need a temporary disable kexec
> > > solution.
> > > 
> > 
> > No, the solution is to preserve the shared state and pass it over via KHO.
> 
> Okay, then work towards it without doing temporary KEXEC disable. We can
> call out that kexec is not supported until then. Disabling KEXEC is too
> intrusive.
> 

What do you mean by "too intrusive"? The change if local to driver's
Kconfig. There are no verbal "callouts" in upstream Linux - that's
exactly what Kconfig is used for. Once the proper solution is
implemented, we can remove the restriction.

> Is there any precedent for this? Do you know if any driver ever disabled
> KEXEC this way?
> 

No, but there is no other similar driver like this one.
Why does it matter though?

> > 
> > > > 
> > > > > Also, what makes you think customers would even be interested in 
> > > > > enabling
> > > > > our module in their kernel configs if it takes away kexec?
> > > > > 
> > > > 
> > > > It's simple: L1VH isn't a host, so I can spin up new VMs instead of
> > > > servicing the existing ones.
> > > 
> > > And what about the L2 VM state then? They might not be throwaway in all
> > > cases.
> > > 
> > 
> > L2 guest can (and likely will) be migrated fromt he old L1VH to the new
> > one.
> > And this is most likely the current scenario customers are using.
> > 
> > > > 
> > > > Why do you think there won’t be customers interested in using MSHV in
> > > > L1VH without kexec support?
> > > 
> > > Because they could already be using kexec for their servicing needs or
> > > whatever. And no we can't just say "don't service these VMs just spin up
> > > new ones".
> > > 
> > 
> > Are you speculating or know for sure?
> 
> It's a reasonable assumption that people are using kexec for servicing.
> 

Again, using kexec for servicing is not supported: why pretending it is?

> > 
> > > Also, keep in mind that once L1VH is available in Azure, the distros
> > > that run on it would be the same distros that run on all other Azure
> > > VMs. There won't be special distros with a kernel specifically built for
> > > L1VH. And KEXEC is generally enabled in distros. Distro vendors won't be
> > > happy that they would need to publish a separate version of their image 
> > > with
> > > MSHV_ROOT enabled and KEXEC disabled because they wouldn't want KEXEC to
> > > be disabled for all Azure VMs. Also, the customers will be confused why
> > > the same distro doesn't work on L1VH.
> > > 
> > 
> > I don't think distro happiness is our concern. They already build custom
> 
> If distros are not happy they won't package this and consequently
> nobody will use it.
> 

Could you provide an example of such issues in the past?

> > versions for Azure. They can build another custom version for L1VH if
> > needed.
> 
> We should at least check if they are ready to do this.
> 

This is a labor intrusive and long-term check. Unless there is a solid
evidence that they won't do it, I don't see the point in doing this.

Thanks,
Stanislav

> Thanks,
> Anirudh.
> 
> > 
> > Anyway, I don't see the point in continuing this discussion. All points
> > have been made, and solutions have been proposed.
> > 
> > If you can come up with something better in the next few days, so we at
> > least have a chance to get it merged in the next merge window, great. If
> > not, we should explicitly forbid the unsupported feature and move on.
> > 
> > Thanks,
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > Anirudh.

Reply via email to