Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, October 6. Thanks to everybody who was involved!
These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- Pasha started off with LUO v4 discussion points. As discovered during iommu preservation series review, we needed dependency tracking for situations where one fd depends on a resource. There are two options for this dependency tracking: - Option 1: when the fd is preserved with a callback, can_preserve(), which determines if it can be preserved (and all dependencies are already preserved) - Option 2: when we go into prepare(), we check the inter dependencies, which also allows for cross dependencies (A depending on B, B depending on A) In Option 1, userspace must declare which dependencies must be grabbed in which order; this is not a requirement for Option 2. Pratyush suggested to start we could go with Option 1, including for unpreserve. Pratyush suggested there may be an Option 3 where LUO preserves a group of fds, and they get checked as an entire unit (similar to fdbox). Pasha said that grouping restore is not possible, same as session grouping. Praveen Kumar suggested that ordering is important; Pasha agreed, but said that the question was when this happens: during preserve or during prepare. Jason suggested that preserve would need to know the dependenices at that time (Option 1). He said that we wouldn't be able to allow the memfd to become mutable until all the fds are put back on it; the sequencing would require the ordering to be establised at the time of preserve. Pasha suggested that when we go to prepare phase, it is likely fine to have them in a different order as long as we are 100% sure that the memfd is going to be serialized. Jason said when the memfd is frozen, we can't have inconsistencies; the iommu can take the page pin but you can still ftruncate() the memfd and that will make the memory delay freed by the iommu. The conclusion then was that Option 2 was not possible, we need to know the right sequence before prepare. Prepare would likely need to do its work in the same order that can_preserve() was called. Pratyush said LUO v4 already had per-fd freezing so we could force userspace to do it; for example, when you preserve iommufd you just force userspace to first preserve and prepare the memfd that is dependent on it. In this case, the kernel is just checking and enforcing this. The consensus was to check the depedencies when you do the ioctl and then fail the ioctl if needed; userspace does all the ordering required. ----->o----- Pasha asked if we still want userspace to provide tokens, which was previously needed because we had global fd preservation; now, each session has a token that starts at zero: whatever was preserved first has token 0, what was preserved second has token 1, etc. Jason suggested against the kernel issuing tokens again: the token allowed predecessor and successor VMMs to have an ABI where they can say this object is this thing with this token, and they can then pull it out with that token. Pratyush asked what we want to solve with kernel issued tokens. Pasha suggested we might be able to use tokens for ordering. Pratyush said if userspace uses tokens then they can use the same scheme. We decided that tokens should be removed from ordering. ----->o----- There was discussion on whether sessions should be removed entirely or not. Pasha noted that iommu required a subsystem because it requires cross file descriptor data during boot. Jason suggested not expressing it as a subsystem with callbacks; the goal is that the first thread that gets to serialize the iommu synchronously creates the serialization data under a LUO lock, the next thread gets the serial data under a LUO lock and may make a little change. We likely don't want to track this as part of a subsystem abstraction. Pasha said that with no callback, for iommu, what we'd want is a call into LUO during boot to ask for the data. Jason suggested this may be correct but was focused more on the suspend side. He said during probe we'd have to ask for the early boot data if it exists. Pasha was going to propose an RFC discussion usptream on the APIs for this. ----->o----- Andrey discussed the current status of his KSTATE work. He wanted to describe what should be preserved without requiring major subsystem changes. He suggested a description in common code that parses the description and saves and restores with versioning. For example, for struct a, the struct kstate_description would include the min_version_id that we can restore from. It includes a state list and a list of fields to preserve. The KSTATE data format includes a magic number, state_id, version_id, instance_id, and then the size of the data for preservation. The states are then repeated. This includes fields versioning; when data is added, the version is bumped for the field. This allows for making compatible changes for the new kernel. Jason was against the idea of throwing away data; the idea is that if the old kernel included data then it would be wrong for the new kernel to then throw it away. Andrey suggested bumping the min_version_id. Ben Chaney suggested it will be useful for adding a new field that the old kernel did not support. Jason said that if data was changed, there would likely need to be a significant code flow change associated with it; a recent example is the vmalloc patch series that ended up in being a significant change. ----->o----- Next meeting will be on Monday, October 20 at 8am PDT (UTC-7), everybody is welcome: https://meet.google.com/rjn-dmzu-hgq Topics for the next meeting: - follow up in fd dependency checking and this happening at the time of preserve rather than prepare (Option 1) - follow up on not relying on subsystems in LUO and the APIs on both sides of the live update for getting data needed - update on latest status of LUO and next steps for merge into akpm's tree - update on the status of stateless KHO RFC patches that should simplify LUO support - update on memfd preservation, vmalloc support, and 1GB limitation - discuss guest_memfd preservation use cases for Confidential Computing and any current work happening on it, including overlap with memfd preservation being worked on by Pratyush + discuss any use cases for Confidential Computing where folios may need to be split after being marked as preserved during brown out - [15 min] summarize upstream iommu persistence discussion and surface any misalignment - later: testing methodology to allow downstream consumers to qualify that live update works from one version to another - later: reducing blackout window during live update Please let me know if you'd like to propose additional topics for discussion, thank you!
