On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii 
> > > > > > wrote:
> > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam 
> > > > > > > wrote:
> > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii 
> > > > > > > > wrote:
> > > > > > > > > The MSHV driver deposits kernel-allocated pages to the 
> > > > > > > > > hypervisor during
> > > > > > > > > runtime and never withdraws them. This creates a fundamental 
> > > > > > > > > incompatibility
> > > > > > > > > with KEXEC, as these deposited pages remain unavailable to 
> > > > > > > > > the new kernel
> > > > > > > > > loaded via KEXEC, leading to potential system crashes upon 
> > > > > > > > > kernel accessing
> > > > > > > > > hypervisor deposited pages.
> > > > > > > > > 
> > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page 
> > > > > > > > > lifecycle
> > > > > > > > > management is implemented.
> > > > > > > > 
> > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which 
> > > > > > > > is valid
> > > > > > > > and would work without any issue for L1VH.
> > > > > > > > 
> > > > > > > 
> > > > > > > No, it won't work and hypervsisor depostied pages won't be 
> > > > > > > withdrawn.
> > > > > > 
> > > > > > All pages that were deposited in the context of a guest partition 
> > > > > > (i.e.
> > > > > > with the guest partition ID), would be withdrawn when you kill the 
> > > > > > VMs,
> > > > > > right? What other deposited pages would be left?
> > > > > > 
> > > > > 
> > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > withdrawn).
> > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > host partition.
> > > > 
> > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > then reclaim memory? Would this help with kernel consistency
> > > > irrespective of userspace behavior?
> > > > 
> > > 
> > > It would, but this is sloppy and cannot be a long-term solution.
> > > 
> > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > may still crash.
> > 
> > Actually guests won't be running by the time we reach our module_exit
> > function during a kexec. Userspace processes would've been killed by
> > then.
> > 
> 
> No, they will not: "kexec -e" doesn't kill user processes.
> We must not rely on OS to do graceful shutdown before doing
> kexec.

I see kexec -e is too brutal. Something like systemctl kexec is
more graceful and is probably used more commonly. In this case at least
we could register a reboot notifier and attempt to clean things up.

I think it is better to support kexec to this extent rather than
disabling it entirely.

> 
> > Also, why is this sloppy? Isn't this what module_exit should be
> > doing anyway? If someone unloads our module we should be trying to
> > clean everything up (including killing guests) and reclaim memory.
> > 
> 
> Kexec does not unload modules, but it doesn't really matter even if it
> would.
> There are other means to plug into the reboot flow, but neither of them
> is robust or reliable.
> 
> > In any case, we can BUG() out if we fail to reclaim the memory. That would
> > stop the kexec.
> > 
> 
> By killing the whole system? This is not a good user experience and I
> don't see how can this be justified.

It is justified because, as you said, once we reach that failure we can
no longer guarantee integrity. So BUG() makes sense. This BUG() would
cause the system to go for a full reboot and restore integrity.

> 
> > This is a better solution since instead of disabling KEXEC outright: our
> > driver made the best possible efforts to make kexec work.
> > 
> 
> How an unrealiable feature leading to potential system crashes is better
> that disabling kexec outright?

Because there are ways of using the feature reliably. What if someone
has MSHV_ROOT enabled but never start a VM? (Just because someone has our
driver enabled in the kernel doesn't mean they're using it.) What about crash
dump?

It is far better to support some of these scenarios and be unreliable in
some corner cases rather than disabling the feature completely.

Also, I'm curious if any other driver in the kernel has ever done this
(force disable KEXEC).

> 
> It's a complete opposite story for me: the latter provides a limited,
> but robust functionality, while the former provides an unreliable and
> unpredictable behavior.
> 
> > > 
> > > There are two long-term solutions:
> > >  1. Add a way to prevent kexec when there is shared state between the 
> > > hypervisor and the kernel.
> > 
> > I honestly think we should focus efforts on making kexec work rather
> > than finding ways to prevent it.
> > 
> 
> There is no argument about it. But until we have it fixed properly, we
> have two options: either disable kexec or stop claiming we have our
> driver up and ready for external customers. Giving the importance of
> this driver for current projects, I believe the better way would be to
> explicitly limit the functionality instead of postponing the
> productization of the driver.

It is okay to claim our driver as ready even if it doesn't support all
kexec cases. If we can support the common cases such as crash dump and
maybe kexec based servicing (pretty sure people do systemctl kexec and
not kexec -e for this with proper teardown) we can claim that our driver
is ready for general use.

Thanks,
Anirudh.


Reply via email to