On Tue, Feb 03, 2026 at 04:46:03PM +0000, Anirudh Rayabharam wrote: > On Tue, Feb 03, 2026 at 07:40:36AM -0800, Stanislav Kinsburskii wrote: > > On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote: > > > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote: > > > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote: > > > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote: > > > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote: > > > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii > > > > > > > wrote: > > > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam > > > > > > > > wrote: > > > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav > > > > > > > > > Kinsburskii wrote: > > > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh > > > > > > > > > > Rayabharam wrote: > > > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav > > > > > > > > > > > Kinsburskii wrote: > > > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh > > > > > > > > > > > > Rayabharam wrote: > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav > > > > > > > > > > > > > Kinsburskii wrote: > > > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to > > > > > > > > > > > > > > the hypervisor during > > > > > > > > > > > > > > runtime and never withdraws them. This creates a > > > > > > > > > > > > > > fundamental incompatibility > > > > > > > > > > > > > > with KEXEC, as these deposited pages remain > > > > > > > > > > > > > > unavailable to the new kernel > > > > > > > > > > > > > > loaded via KEXEC, leading to potential system > > > > > > > > > > > > > > crashes upon kernel accessing > > > > > > > > > > > > > > hypervisor deposited pages. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until > > > > > > > > > > > > > > proper page lifecycle > > > > > > > > > > > > > > management is implemented. > > > > > > > > > > > > > > > > > > > > > > > > > > Someone might want to stop all guest VMs and do a > > > > > > > > > > > > > kexec. Which is valid > > > > > > > > > > > > > and would work without any issue for L1VH. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > No, it won't work and hypervsisor depostied pages won't > > > > > > > > > > > > be withdrawn. > > > > > > > > > > > > > > > > > > > > > > All pages that were deposited in the context of a guest > > > > > > > > > > > partition (i.e. > > > > > > > > > > > with the guest partition ID), would be withdrawn when you > > > > > > > > > > > kill the VMs, > > > > > > > > > > > right? What other deposited pages would be left? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The driver deposits two types of pages: one for the guests > > > > > > > > > > (withdrawn > > > > > > > > > > upon gust shutdown) and the other - for the host itself > > > > > > > > > > (never > > > > > > > > > > withdrawn). > > > > > > > > > > See hv_call_create_partition, for example: it deposits > > > > > > > > > > pages for the > > > > > > > > > > host partition. > > > > > > > > > > > > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in > > > > > > > > > module_exit? > > > > > > > > > Also, can't we forcefully kill all running partitions in > > > > > > > > > module_exit and > > > > > > > > > then reclaim memory? Would this help with kernel consistency > > > > > > > > > irrespective of userspace behavior? > > > > > > > > > > > > > > > > > > > > > > > > > It would, but this is sloppy and cannot be a long-term solution. > > > > > > > > > > > > > > > > It is also not reliable. We have no hook to prevent kexec. So > > > > > > > > if we fail > > > > > > > > to kill the guest or reclaim the memory for any reason, the new > > > > > > > > kernel > > > > > > > > may still crash. > > > > > > > > > > > > > > Actually guests won't be running by the time we reach our > > > > > > > module_exit > > > > > > > function during a kexec. Userspace processes would've been killed > > > > > > > by > > > > > > > then. > > > > > > > > > > > > > > > > > > > No, they will not: "kexec -e" doesn't kill user processes. > > > > > > We must not rely on OS to do graceful shutdown before doing > > > > > > kexec. > > > > > > > > > > I see kexec -e is too brutal. Something like systemctl kexec is > > > > > more graceful and is probably used more commonly. In this case at > > > > > least > > > > > we could register a reboot notifier and attempt to clean things up. > > > > > > > > > > I think it is better to support kexec to this extent rather than > > > > > disabling it entirely. > > > > > > > > > > > > > You do understand that once our kernel is released to third parties, we > > > > can’t control how they will use kexec, right? > > > > > > Yes, we can't. But that's okay. It is fine for us to say that only some > > > kexec scenarios are supported and some aren't (iff you're creating VMs > > > using MSHV; if you're not creating VMs all of kexec is supported). > > > > > > > Well, I disagree here. If we say the kernel supports MSHV, we must > > provide a robust solution. A partially working solution is not > > acceptable. It makes us look careless and can damage our reputation as a > > team (and as a company). > > It won't if we call out upfront what is supported and what is not. > > > > > > > > > > > This is a valid and existing option. We have to account for it. Yet > > > > again, L1VH will be used by arbitrary third parties out there, not just > > > > by us. > > > > > > > > We can’t say the kernel supports MSHV until we close these gaps. We must > > > > > > We can. It is okay say some scenarios are supported and some aren't. > > > > > > All kexecs are supported if they never create VMs using MSHV. If they do > > > create VMs using MSHV and we implement cleanup in a reboot notifier at > > > least systemctl kexec and crashdump kexec would which are probably the > > > most common uses of kexec. It's okay to say that this is all we support > > > as of now. > > > > > > > I'm repeating myself, but I'll try to put it differently. > > There won't be any kernel core collected if a page was deposited. You're > > arguing for a lost cause here. Once a page is allocated and deposited, > > the crash kernel will try to write it into the core. > > That's why we have to implement something where we attempt to destroy > partitions and reclaim memory (and BUG() out if that fails; which > hopefully should happen very rarely if at all). This should be *the* > solution we work towards. We don't need a temporary disable kexec > solution. >
No, the solution is to preserve the shared state and pass it over via KHO. > > > > > Also, what makes you think customers would even be interested in enabling > > > our module in their kernel configs if it takes away kexec? > > > > > > > It's simple: L1VH isn't a host, so I can spin up new VMs instead of > > servicing the existing ones. > > And what about the L2 VM state then? They might not be throwaway in all > cases. > L2 guest can (and likely will) be migrated fromt he old L1VH to the new one. And this is most likely the current scenario customers are using. > > > > Why do you think there won’t be customers interested in using MSHV in > > L1VH without kexec support? > > Because they could already be using kexec for their servicing needs or > whatever. And no we can't just say "don't service these VMs just spin up > new ones". > Are you speculating or know for sure? > Also, keep in mind that once L1VH is available in Azure, the distros > that run on it would be the same distros that run on all other Azure > VMs. There won't be special distros with a kernel specifically built for > L1VH. And KEXEC is generally enabled in distros. Distro vendors won't be > happy that they would need to publish a separate version of their image with > MSHV_ROOT enabled and KEXEC disabled because they wouldn't want KEXEC to > be disabled for all Azure VMs. Also, the customers will be confused why > the same distro doesn't work on L1VH. > I don't think distro happiness is our concern. They already build custom versions for Azure. They can build another custom version for L1VH if needed. Anyway, I don't see the point in continuing this discussion. All points have been made, and solutions have been proposed. If you can come up with something better in the next few days, so we at least have a chance to get it merged in the next merge window, great. If not, we should explicitly forbid the unsupported feature and move on. Thanks, Thanks, Stanislav > Thanks, > Anirudh.
