Re: [RFC 00/18] Pkernfs: Support persistence for live update

2024-02-07 Thread Jason Gunthorpe
On Wed, Feb 07, 2024 at 02:56:33PM +, Gowans, James wrote:
> 2. Tell VFIO to avoid mapping the memory in again after live update
> because it already exists.
> https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
> (the above flag should only be set *after* live update...).

Definately no to that entire idea. It completely breaks how the memory
lifetime model works in iommufd.

iommufd has to re-establish its pins, and has to rebuild all its
mapping data structures. Otherwise it won't work correctly at all.

This is what I was saying in the other thread, you can't just ignore
fully restoring the iommu environment.

The end goal must be to have fully reconstituted iommufd with all its
maps, ioas's, and memory pins back to fully normal operation.

IMHO you need to focus on atomic replace where you go from the frozen
pkernfs environment to a live operating enviornment by hitlessly
replacing the IO page table in the HW. Ie going from an
IOMMU_DOMAIN_PKERFS to an IOMMU_DOMAIN_PAGING owned by iommufd that
describes exactly the same translation.

"adopting" an entire io page table with unknown contents, and still
being able to correctly do map/unmap seems way too hard.

Jason

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 00/18] Pkernfs: Support persistence for live update

2024-02-07 Thread Jason Gunthorpe
On Wed, Feb 07, 2024 at 02:45:42PM +, Gowans, James wrote:
> Hi Jason,
> 
> Thanks for this great feedback on the approach - it's exactly the sort
> of thing we were looking for.
> 
> On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote:
> > 
> > On Mon, Feb 05, 2024 at 12:01:45PM +, James Gowans wrote:
> > 
> > > The main aspect we’re looking for feedback/opinions on here is the 
> > > concept of
> > > putting all persistent state in a single filesystem: combining guest RAM 
> > > and
> > > IOMMU pgtables in one store. Also, the question of a hard separation 
> > > between
> > > persistent memory and ephemeral memory, compared to allowing arbitrary 
> > > pages to
> > > be persisted. Pkernfs does it via a hard separation defined at boot time, 
> > > other
> > > approaches could make the carving out of persistent pages dynamic.
> > 
> > I think if you are going to attempt something like this then the end
> > result must bring things back to having the same data structures fully
> > restored.
> > 
> > It is fine that the pkernfs holds some persistant memory that
> > guarentees the IOMMU can remain programmed and the VM pages can become
> > fixed across the kexec
> > 
> > But once the VMM starts to restore it self we need to get back to the
> > original configuration:
> >  - A mmap that points to the VM's physical pages
> >  - An iommufd IOAS that points to the above mmap
> >  - An iommufd HWPT that represents that same mapping
> >  - An iommu_domain programmed into HW that the HWPT
> 
> (A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for
> now because that's what I know, but will explore iommufd more and reply
> in more detail about iommufd in the other email thread.)
> 
> How much of this do you think should be done automatically, vs how much
> should userspace need to drive? With this RFC userspace basically re-
> drives everything, including re-injecting the file containing the
> persistent page tables into the IOMMU domain via VFIO.

My guess is that fully automatically is hard/impossible as there is
lots and lots of related state that has to come back. Like how do you
get all the internal iommufs IOAS related datastructures
automatically. Seems way too hard.

Feels simpler to have userspace redo whatever setup was needed to get
back to the right spot.

> Part of the reason is simplicity, to avoid having auto-deserialise code
> paths in the drivers and modules. Another part of the reason so that
> userspace can get FD handles on the resources. Typically FDs are
> returned by doing actions like creating VFIO containers. If we make all
> that automatic then there needs to be some other mechanism for auto-
> restored resources to present themselves to userspace so that userspace
> can discover and pick them up again.

Right, there is lots of state all over the place that would hard to
just re-materialize.

> Can you expand on what you mean by "A mmap that points to the VM's
> physical pages?" Are you suggesting that the QEMU process automatically
> gets something appearing in it's address space? Part of the live update
> process involves potentially changing the userspace binaries: doing
> kexec and booting a new system is an opportunity to boot new versions of
> the userspace binary. So we shouldn't try to preserve too much of
> userspace state; it's better to let it re-create internal data
> structures do fresh mmaps.

I expect the basic flow would be like:

 Starting kernel
   - run the VMM
   - Allocate VM memory in the pkernfs
   - mmap that VM memory
   - Attach the VM memory to KVM
   - Attach the VM memory mmap to IOMMUFD
   - Operate the VM

 Suspending the kernel
   - Stop touching iommufd
   - Freeze changes to the IOMMU, and move its working memory to pkernfs
   - Exit the kernel

 New kernel
   - Recover the frozen IOMMU back to partially running, like crash
 dump. Continue to use some of the working memory in the pkernfs
   - run the new VMM. Some IOMMU_DOMAIN_PKERNFS thing to represent
 this state
   - mmap the VM memory
   - Get KVM going again
   - Attach the new VMM's VM memory mmap to IOMMUFD
   - Replace the iommu partial configuration with a full configuration
   - Free the pkernfs iommu related memory

> What I'm really asking is: do you have a specific suggestion about how
> these persistent resources should present themselves to userspace and
> how userspace can discover them and pick them up?

The only tricky bit in the above is having VFIO know it should leave
the iommu and PCI device state alone when the VFIO cdev is first
opened. Otherwise everything else is straightforward.

Presumably vfio would know it inherited a pkernfs blob and would do
the right stuff. May be some uAPI fussing there to handshake that
properly

Once VFIO knows this it can operate iommufd to conserve the
IOMMU_DOMAIN_PKERNFS as well.

> > Ie you can't just reboot and leave the IOMMU hanging out in some
> > undefined land - especially in latest kernels!
>
> Not too 

Re: [RFC 00/18] Pkernfs: Support persistence for live update

2024-02-07 Thread Gowans, James
On Mon, 2024-02-05 at 10:10 -0700, Alex Williamson wrote:
> > * Needing to drive and re-hydrate the IOMMU page tables by defining
> > an IOMMU file.
> > Really we should move the abstraction one level up and make the
> > whole VFIO
> > container persistent via a pkernfs file. That way you’d "just" re-
> > open the VFIO
> > container file and all of the DMA mappings inside VFIO would already
> > be set up.
> 
> Note that the vfio container is on a path towards deprecation, this
> should be refocused on vfio relative to iommufd.  There would need to
> be a strong argument for a container/type1 extension to support this,
> iommufd would need to be the first class implementation.  Thanks,

Ack! When I first started putting pkernfs together, iommufd wasn't
integrated into QEMU yet, hence I stuck with VFIO for this PoC.
I'm thrilled to see that iommufd now seems to be integrated in QEMU!
Good opportunity to get to grips with it.

The VFIO-specific part of this patch is essentially ioctls on the
*container* to be able to:

1. define persistent page tables (PPTs) on the containers so that those
PPTs are used by the IOMMU domain and hence by all devices added to that
container.
https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e

2. Tell VFIO to avoid mapping the memory in again after live update
because it already exists.
https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
(the above flag should only be set *after* live update...).

Do you have a rough suggestion about how similar could be done with
iommufd?

JG

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC 00/18] Pkernfs: Support persistence for live update

2024-02-07 Thread Gowans, James
Hi Jason,

Thanks for this great feedback on the approach - it's exactly the sort
of thing we were looking for.

On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote:
> 
> On Mon, Feb 05, 2024 at 12:01:45PM +, James Gowans wrote:
> 
> > The main aspect we’re looking for feedback/opinions on here is the concept 
> > of
> > putting all persistent state in a single filesystem: combining guest RAM and
> > IOMMU pgtables in one store. Also, the question of a hard separation between
> > persistent memory and ephemeral memory, compared to allowing arbitrary 
> > pages to
> > be persisted. Pkernfs does it via a hard separation defined at boot time, 
> > other
> > approaches could make the carving out of persistent pages dynamic.
> 
> I think if you are going to attempt something like this then the end
> result must bring things back to having the same data structures fully
> restored.
> 
> It is fine that the pkernfs holds some persistant memory that
> guarentees the IOMMU can remain programmed and the VM pages can become
> fixed across the kexec
> 
> But once the VMM starts to restore it self we need to get back to the
> original configuration:
>  - A mmap that points to the VM's physical pages
>  - An iommufd IOAS that points to the above mmap
>  - An iommufd HWPT that represents that same mapping
>  - An iommu_domain programmed into HW that the HWPT

(A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for
now because that's what I know, but will explore iommufd more and reply
in more detail about iommufd in the other email thread.)

How much of this do you think should be done automatically, vs how much
should userspace need to drive? With this RFC userspace basically re-
drives everything, including re-injecting the file containing the
persistent page tables into the IOMMU domain via VFIO.

Part of the reason is simplicity, to avoid having auto-deserialise code
paths in the drivers and modules. Another part of the reason so that
userspace can get FD handles on the resources. Typically FDs are
returned by doing actions like creating VFIO containers. If we make all
that automatic then there needs to be some other mechanism for auto-
restored resources to present themselves to userspace so that userspace
can discover and pick them up again.

One possible way to do this would be to populate a bunch of files in
procfs for each persisted IOMMU domain that allows userspace to discover
and pick it up.

Can you expand on what you mean by "A mmap that points to the VM's
physical pages?" Are you suggesting that the QEMU process automatically
gets something appearing in it's address space? Part of the live update
process involves potentially changing the userspace binaries: doing
kexec and booting a new system is an opportunity to boot new versions of
the userspace binary. So we shouldn't try to preserve too much of
userspace state; it's better to let it re-create internal data
structures do fresh mmaps.

What I'm really asking is: do you have a specific suggestion about how
these persistent resources should present themselves to userspace and
how userspace can discover them and pick them up?

> 
> Ie you can't just reboot and leave the IOMMU hanging out in some
> undefined land - especially in latest kernels!

Not too sure what you mean by "undefined land" - the idea is that the
IOMMU keeps doing what it was going until userspace comes along re-
creates the handles to the IOMMU at which point it can do modifications
like change mappings or tear the domain down. This is what deferred
attached gives us, I believe, and why I had to change it to be enabled.
Just leave the IOMMU domain alone until userspace re-creates it with the
original tables.
Perhaps I'm missing your point. :-)

> 
> For vt-d you need to retain the entire root table and all the required
> context entries too, The restarting iommu needs to understand that it
> has to "restore" a temporary iommu_domain from the pkernfs.
> You can later reconstitute a proper iommu_domain from the VMM and
> atomic switch.

Why does it need to go via a temporary domain? The current approach is
just to leave the IOMMU domain running as-is via deferred attached, and
later when userspace starts up it will create the iommu_domain backed by
the same persistent page tables.
> 
> So, I'm surprised to see this approach where things just live forever
> in the kernfs, I don't see how "restore" is going to work very well
> like this.

Can you expand on why the suggested restore path will be problematic? In
summary the idea is to re-create all of the "ephemeral" data structures
by re-doing ioctls like MAP_DMA, but keeping the persistent IOMMU
root/context tables pointed at the original persistent page tables. The
ephemeral data structures are re-created in userspace but the persistent
page tables left alone. This is of course dependent on userspace re-
creating things *correctly* - it can easily do the wrong thing. Perhaps
this is the issue? Or is there a problem 

Re: [PATCH kexec-tools] workflow: update to use checkout@v4

2024-02-07 Thread Simon Horman
On Fri, Feb 02, 2024 at 09:55:41AM +0100, Simon Horman wrote:
> Update to use checkout@v4.
> 
> This addresses the following warning that appears in GitHub runs:
> 
> "Node.js 16 actions are deprecated. Please update the following actions
>  to use Node.js 20: actions/checkout@v3. For more information
>  see: 
> https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
> 
> Link: https://github.com/horms/kexec-tools/actions/runs/7753454923
> Signed-off-by: Simon Horman 

Applied.

- workflow: update to use checkout@v4
  
https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=6ee2ac1bf739


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec