[Hypervisor Live Update] Notes from July 28, 2025

David Rientjes Sun, 10 Aug 2025 21:23:46 -0700

Hi everybody,

Here are the notes from the last Hypervisor Live Update call that happened 
on Monday, July 14.  Thanks to everybody who was involved!


These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Pasha Tatashin discussed LUO v2 and what parts could be staged in the MM
tree.  Akpm suggested updating once the 6.17 merge window is complete and
then rebase and send v3 which should be around the next time we meet.
There will continue to be feedback addressed upstream in the interim.

Pasha noted that there were still filesystem discussions happening, so
that may change on future iterations.  Jason asked if there would be
objections to the series upstream; Mike Rapoport did not believe there
would be blockers that would prevent merge.

Mike noted that systemd folks were not overly happy with the new UABI but
that they can live with it.  The nature of the feedback was not
necessarily about the ioctls, but rather the systemd preference to have
everything be hierarchical.  Mike noted that upstream feedback preferred
VFS be mountable because of a preference for user namespaces, but there
was not a use case beyond init namespaces for now.  The feedback from the
group was that we should define the security model first before requiring
a filesystem.

I asked if this was a one-way door, if we could eventually go back and
support filesystem in the future; Mike suggested that we would need to
continue to support char devices in that case.  Pasha agreed that we could
do an extension in the future if needed.

Pasha asked if we should enforce a single open of /dev/liveupdate.  Jason
thought this was appropriate due to the clean up semantics.  Pratyush
noted that if we support multiple opens then the agent can hand out the
opened fd's to the requesters and if they crash the fd goes away as well.
Jason said we should do this with an ioctl to give you a new fd with a
session label.  Jason was concerned about the lack of the ability to
preserve the access control for a filesystem across kexec.

I confirmed with the group in the call that there was no objections to
proceeding as discussed for LUO v3.  Pasha noted the only planned changes
would be to enforce a single open and then use the ioctl to get the
session.  Jason was going to provide minor feedback upstream.

----->o-----
I pivoted to discussing the agent, affectionately referred to as
LiveupdateD.  As the LUO design was not yet finalized, there was no
decisions formally made on the LiveupdateD design.  Pasha suggested that
LUO v3 and LiveupdateD development could proceed in parallel.

We discussed where both LiveupdateD and libluo git repositories should
live.  There was a general feeling that this should be decoupled from the
kernel tree.  Pratyush was going to create a repository both for GitHub as
well as kernel.org.

----->o-----
Frank van der Linden then discussed his concept for a physical memory pool
allocator.  Frank presented this as an allocator for large contiguous
stretches of memory, such as HugeTLB and guest memory, including for
partial mappings like PFNMAP required for Confidential Computing use
cases in any size that we want.  If we have a 1GB physical range of
memory associated with guest_memfd, for example, it's fine to map a couple
megabytes of it; if you have a 1GB folio, however, you either map the
entire folio or nothing at all (or break folio).

Jason said that if you are happy with PFNMAP, you can map part of the
folio; you only need to break up the folio itself if you are going to put
the folio into the VMA as an actual folio.

Vishal Annapurve said that the way we are handling 1GB support today with
guest_memfd hands out physical memory backing shared ranges to the core MM
so that things like gup can use it normally.  This requires breaking the
folio.  Jason noted, as he mentioned previously, that you can keep the 1GB
folio and use pfnmaps everywhere else and ignore the folio.

Frank noted that our interim solution of using many devdax for preserving
memory across kexec is more convenient with an allocator underneath it to
avoid needing to specify tons of memmap= options on the kernel command
line.  Similar to hugetlb_cma=, this would specify external memory such as
extmem=0:1G,2:4G.  Only the bitmap allocator is used for creating raw CMA
areas that are physical memory only (this memory is removed from
memblock).  Jason suggested that there may be pushback upstream for
additional allocators; Frank suggested that if this forms the basis for
other allocators then this may be a different story.

Vishal asked if we should mandate PFNMAP mappings only for guest_memfd.
Jason said that if our only need for guest_memfd is to create a vma to
speed it into KVM, then use PFNMAP is reasonable, which disables ODIRECT
and RDMA, etc.  That may be appropriate for Confidential Computing, but
qemu certainly relies on it today.  So while it may not be possible to get
rid of the folios in a broad sense, it could be possible to propose.

Frank noted there is a conversion layer that allocates the PFN range and
does minimal hotplug as necessary, converts to HugeTLB pages, and adds to
the pool.  Additionally, there is a device interface that can be used for
testing.

The pool and device interfaces work, as well as KHO on top of this.  I
asked about the minimal chunk sizes and alignment, Frank said this has the
same alignment requirements as CMA today.

I asked about early RFC timelines and use of the biweekly Linux MM
Alignment Session.  Frank said an RFC about a month from now would likely
be feasible and we should continue to discuss Jason's point about the
concurrent use of folios.

Jason suggested that we need an answer to what problem this is solving and
if it's only to use PFNMAP memory for guest_memfd then this may be a lot
of complexity when there are cheaper alternatives.

----->o-----
We transitioned to discussing PCI device preservation.  Chris Li had sent
out the patch series the night before that can get the list of live update
devices and their respective dependencies.  For example, if you need to
preserve the VF and then you preserve the PF as well.  Additionally, if
you preserve the PCI device then the parent bridge also needs to be
preserved; this dependency goes all the way up to the root bridge.

Jason questioned how it will keep track of what devices are bound, that
makes the driver name part of the ABI which isn't great.  He imagined that
if the device is marked as preserved for KHO then it would sit in limbo
until somebody claims it.  Chris said that the kernel cannot detect when
we're finished probing the device.  Jason suggested using an ioctl into
the LUO file descriptor so this becomes the responsibility of userspace.

----->o-----
Next meeting will be on Monday, August 11 at 8am PDT (UTC-7), everybody is
welcome: https://meet.google.com/rjn-dmzu-hgq

Topics for the next meeting:

 - follow-up on sticky preservations with KHO, any additional insight
   provided for MSHV use cases
 - update on LUO v3 and akpm's request to rebase on top of 6.17-rc1 and
   send v3
 - discuss enforcing single open in LUO and also support for using the
   ioctl to get an fd with a session label
 - design discussion for luod, the agent previously referred to as
   LiveupdateD
 - update on physical pool allocator wrt folio support and the ability to
   pfnmap from within the range of memory
 - update on PCI preservation, registration, and initialization, and the
   RFC patch series posted previously
 - later: testing methodology to allow downstream consumers to qualify
   that live update works from one version to another
 - later: reducing blackout window during live update

Please let me know if you'd like to propose additional topics for
discussion, thank you!

[Hypervisor Live Update] Notes from July 28, 2025

Reply via email to