Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, June 2. Thanks to everybody who was involved!
These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- We chatted about LUO v2 and feedback on the upstream mailing list. There was also some functionality changes proposed for KHO. After the comments for LUO v2 are addressed, Pasha noted he will send the series to Pratyush who will add the memfd preservation on top for LUO v3. This will then be sent as a complete package. I asked if Jason had the opportunity to take a look at it, he had put it aside but not looked at it detail yet. Pratyush planned on doing the review of LUO this week. He also noted that there were some improvements made for memfd preservation since the last biweekly and that it is in good shape. He felt that it was ready for integration into the LUO series itself. A lot of testing was also added for libluo that is suggested for inclusion in the kernel tree itself under tools/. David Matlack noted he was happy to see the test binaries moving into the kernel and into selftests. David's VFIO tests would also need some of this functionality. ----->o----- I asked about the the entire scope of libluo and future looking thoughts for it. Pratyush noted it was currently only a very thin wrapper on top of ioctls. This could be expanded for more orchestration if necessary in the future but no immediate thoughts. Right now, it's just a set of ioctls and tests and command line. Pasha noted one extension could be asynchronous fd preservation that might be done in the future with new ioctls, which could also be a libluo extension. ----->o----- Jason asked if the group was comfortable with the current LUO v2 design or if there were any major concerns: we have the ioctls and the file descriptors, it also eliminates the fdbox work, it reduces the need for guestmemfs, it changes KHO, etc. David Matlack asked about the requirement of CAP_SYS_ADMIN to preserve fd's; Jason suggested using fd permissions instead. Pratyush suggested there were some complexities with fd permissions in the next kernel, Pasha echo'd that it was likely safer to require CAP_SYS_ADMIN at least for now. David expressed concern about a malicious process potentially being able to take over an fd unless root permissions were required. Jason said a somewhat different angle here is for the character device itself where ioctl permissions are typically protected by the fd itself; we may not need a capability check on top of that. Jason suggested a broker agent that would be possible to enforce policy in userspace that would do fd passing, it would do an ioctl to the kernel and then fd pass it to the less privileged entities. Filesystem permissions are much more flexible than capabilities. I asked about future use cases where we may want capability checks, Pasha noted we want to be able to change the global state for the prepare stage and asked if that should be capability protected. David suggested it could potentially be a separate character device. If this was needed later, it could be an incremental add-on. ----->o----- Pasha brought up a topic about the lifecycle of a file descriptor if a process dies or quits. When the VMM is running, it can add an fd to LUO, but what happens if the VMM exits. Do we explicitly remove it from the preservation before going into the finish state? He suggested that it should be automatically removed: we should never preserve an fd for a process that has exited. Jason suggested the kernel should not be doing that and is one reason we may want a security domain, the broker agent could cancel all state associated with that process. Pasha asked what would happen if the agent itself dies; Jason suggested the kernel should fully clean everything up. Pasha acknowledged this was the current plan. Praveen Kumar asked how we would maintain a state for a graceful shutdown when the application is in the preservation state. Pasha said that a graceful shutdown and a non-graceful shutdown are identical from a kernel perspective; the only difference is if the shutdown happened before the prepared state. If before, it's not preserved; if after, serialization has already been done and it's preserved so if the resource remains unclaimed then they are cleaned up in the new kernel. Pasha noted that once we passed prepare, then we are in the critical path to a live update, we're not going to continue running in this state. It should be valid for the agent to exit this state because we cannot add new stuff to preservation list (the agent has nothing to do after this). Pratyush said that with LUO when you preserve the fd you get a token and the token must be saved; the agent would grab this token so the handle is not lost. Jason said that once we terminate, then we lost the ability to undo because the session is lost; Pratyush said that we can undo with the preservation tokens. Jason suggested if you close the fd then the kernel should clean up everything associated with the live update. Pasha asked how would we make sure we are not killed before the reboot. Systemd may make this more complicated. He suggested that if the agent is killed or exits during the prepared phase we cannot undo, we have to reboot. ----->o----- David Matlack expressed a concern that the flow as described would fall apart for KVM since KVM fds cannot be transferred across processes as they have a lot of state associated with the mm struct of the owning process. We'd have to dig into why this isn't allowed as a potential extension. Jason suggested an alternative would be an additional fd that has a container property that can only do fd save and restore within its own container. KVM would have to do its serialization outside of its original process and then recreate its context after the kexec. Jason said the KVM fd would need to be preserved because it is threaded through all the VFIO and IOMMU subsystems. Pasha said we would only preserve the amount of information needed to recreate the VMs. ----->o----- Next meeting will be on Monday, June 16 at 8am PDT (UTC-7), everybody is welcome: https://meet.google.com/rjn-dmzu-hgq Topics for the next meeting: - discuss current status of LUO with memfd preservation and any blockers for upstream merge - discuss userspace broker agent that is responsible for the fd's, the ioctls, and the state machine that interacts with LUO, and any potential open sourcing opportunities - determine timelines for selftest framework for live updates, which could be a significant amount of work - check on status of VFIO selftests that will be useful for automated testing of device preservation - discuss forking off a discussion on iommu and live update that is separate from Hypervisor Live Update (due to scheduling constraints) but to include Jason and interested parties - June 30: update on physical pool allocator that can be used to provide pages for hugetlb, guest_memfd, and memfds - later: testing methodology to allow downstream consumers to qualify that live update works from one version to another - later: reducing blackout window during live update Please let me know if you'd like to propose additional topics for discussion, thank you!