Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, July 14. Thanks to everybody who was involved!
These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- Pasha Tatashin discussed LUO v2 and what parts could be staged in the MM tree. Akpm suggested updating once the 6.17 merge window is complete and then rebase and send v3 which should be around the next time we meet. There will continue to be feedback addressed upstream in the interim. Pasha noted that there were still filesystem discussions happening, so that may change on future iterations. Jason asked if there would be objections to the series upstream; Mike Rapoport did not believe there would be blockers that would prevent merge. Mike noted that systemd folks were not overly happy with the new UABI but that they can live with it. The nature of the feedback was not necessarily about the ioctls, but rather the systemd preference to have everything be hierarchical. Mike noted that upstream feedback preferred VFS be mountable because of a preference for user namespaces, but there was not a use case beyond init namespaces for now. The feedback from the group was that we should define the security model first before requiring a filesystem. I asked if this was a one-way door, if we could eventually go back and support filesystem in the future; Mike suggested that we would need to continue to support char devices in that case. Pasha agreed that we could do an extension in the future if needed. Pasha asked if we should enforce a single open of /dev/liveupdate. Jason thought this was appropriate due to the clean up semantics. Pratyush noted that if we support multiple opens then the agent can hand out the opened fd's to the requesters and if they crash the fd goes away as well. Jason said we should do this with an ioctl to give you a new fd with a session label. Jason was concerned about the lack of the ability to preserve the access control for a filesystem across kexec. I confirmed with the group in the call that there was no objections to proceeding as discussed for LUO v3. Pasha noted the only planned changes would be to enforce a single open and then use the ioctl to get the session. Jason was going to provide minor feedback upstream. ----->o----- I pivoted to discussing the agent, affectionately referred to as LiveupdateD. As the LUO design was not yet finalized, there was no decisions formally made on the LiveupdateD design. Pasha suggested that LUO v3 and LiveupdateD development could proceed in parallel. We discussed where both LiveupdateD and libluo git repositories should live. There was a general feeling that this should be decoupled from the kernel tree. Pratyush was going to create a repository both for GitHub as well as kernel.org. ----->o----- Frank van der Linden then discussed his concept for a physical memory pool allocator. Frank presented this as an allocator for large contiguous stretches of memory, such as HugeTLB and guest memory, including for partial mappings like PFNMAP required for Confidential Computing use cases in any size that we want. If we have a 1GB physical range of memory associated with guest_memfd, for example, it's fine to map a couple megabytes of it; if you have a 1GB folio, however, you either map the entire folio or nothing at all (or break folio). Jason said that if you are happy with PFNMAP, you can map part of the folio; you only need to break up the folio itself if you are going to put the folio into the VMA as an actual folio. Vishal Annapurve said that the way we are handling 1GB support today with guest_memfd hands out physical memory backing shared ranges to the core MM so that things like gup can use it normally. This requires breaking the folio. Jason noted, as he mentioned previously, that you can keep the 1GB folio and use pfnmaps everywhere else and ignore the folio. Frank noted that our interim solution of using many devdax for preserving memory across kexec is more convenient with an allocator underneath it to avoid needing to specify tons of memmap= options on the kernel command line. Similar to hugetlb_cma=, this would specify external memory such as extmem=0:1G,2:4G. Only the bitmap allocator is used for creating raw CMA areas that are physical memory only (this memory is removed from memblock). Jason suggested that there may be pushback upstream for additional allocators; Frank suggested that if this forms the basis for other allocators then this may be a different story. Vishal asked if we should mandate PFNMAP mappings only for guest_memfd. Jason said that if our only need for guest_memfd is to create a vma to speed it into KVM, then use PFNMAP is reasonable, which disables ODIRECT and RDMA, etc. That may be appropriate for Confidential Computing, but qemu certainly relies on it today. So while it may not be possible to get rid of the folios in a broad sense, it could be possible to propose. Frank noted there is a conversion layer that allocates the PFN range and does minimal hotplug as necessary, converts to HugeTLB pages, and adds to the pool. Additionally, there is a device interface that can be used for testing. The pool and device interfaces work, as well as KHO on top of this. I asked about the minimal chunk sizes and alignment, Frank said this has the same alignment requirements as CMA today. I asked about early RFC timelines and use of the biweekly Linux MM Alignment Session. Frank said an RFC about a month from now would likely be feasible and we should continue to discuss Jason's point about the concurrent use of folios. Jason suggested that we need an answer to what problem this is solving and if it's only to use PFNMAP memory for guest_memfd then this may be a lot of complexity when there are cheaper alternatives. ----->o----- We transitioned to discussing PCI device preservation. Chris Li had sent out the patch series the night before that can get the list of live update devices and their respective dependencies. For example, if you need to preserve the VF and then you preserve the PF as well. Additionally, if you preserve the PCI device then the parent bridge also needs to be preserved; this dependency goes all the way up to the root bridge. Jason questioned how it will keep track of what devices are bound, that makes the driver name part of the ABI which isn't great. He imagined that if the device is marked as preserved for KHO then it would sit in limbo until somebody claims it. Chris said that the kernel cannot detect when we're finished probing the device. Jason suggested using an ioctl into the LUO file descriptor so this becomes the responsibility of userspace. ----->o----- Next meeting will be on Monday, August 11 at 8am PDT (UTC-7), everybody is welcome: https://meet.google.com/rjn-dmzu-hgq Topics for the next meeting: - follow-up on sticky preservations with KHO, any additional insight provided for MSHV use cases - update on LUO v3 and akpm's request to rebase on top of 6.17-rc1 and send v3 - discuss enforcing single open in LUO and also support for using the ioctl to get an fd with a session label - design discussion for luod, the agent previously referred to as LiveupdateD - update on physical pool allocator wrt folio support and the ability to pfnmap from within the range of memory - update on PCI preservation, registration, and initialization, and the RFC patch series posted previously - later: testing methodology to allow downstream consumers to qualify that live update works from one version to another - later: reducing blackout window during live update Please let me know if you'd like to propose additional topics for discussion, thank you!