On 10/17/22 14:05, Daniel P. Berrangé wrote:
> On Mon, Oct 17, 2022 at 01:54:03PM +0300, Andrey Ryabinin wrote:
>> These patches add possibility to pass VFIO device to QEMU using file
>> descriptors of VFIO container/group, instead of creating those by QEMU.
>> This allows to take away permissions to open /dev/vfio/* from QEMU and
>> delegate that to managment layer like libvirt.
>>
>> The VFIO API doen't allow to pass just fd of device, since we also need to 
>> have
>> VFIO container and group. So these patches allow to pass created VFIO 
>> container/group
>> to QEMU via command line/QMP, e.g. like this:
>>             -object vfio-container,id=ct,fd=5 \
>>             -object vfio-group,id=grp,fd=6,container=ct \
>>             -device vfio-pci,host=05:00.0,group=grp
>>
>> A bit more detailed example can be found in the test:
>> tests/avocado/vfio.py
>>
>>  *Possible future steps*
>>
>> Also these patches could be a step for making local migration (within one 
>> host)
>> of the QEMU with VFIO devices.
>> I've built some prototype on top of these patches to try such idea.
>> In short the scheme of such migration is following:
>>  - migrate source VM to file.
>>  - retrieve fd numbers of VFIO container/group/device via new property and 
>> qom-get command
>>  - get the actual file descriptor via SCM_RIGHTS using new qmp command 
>> 'returnfd' which
>>    sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 
>> 'int'}}
>>  - shutdown source VM
>>  - launch destination VM, plug VFIO devices using obtained file descriptors.
>>  - PCI device reset duriing plugging the device avoided with the help of new 
>> parameter
>>     on vfio-pci device.
> 
> Is there a restriction by VFIO on how many processes can have the FD
> open concurrently ? I guess it must be, as with SCM_RIGHTS, both src
> QEMU and libvirt will have the FD open concurrently for at least a
> short period, as you can't atomically close the FD at the exact same
> time as SCM_RIGHTS sends it.
> 

There is no such restriction. Several opened descriptors is what allows us to 
survive
PCI device reset. The kernel reset device on the first open.
Obviously we shouldn't use these descriptors concurrently and can't in many 
cases
(ioctl()s will fail), but there is no problem with just saving/passing FD 
between processes.


> With migration it is *highly* desirable to never stop the source VM's
> QEMU until the new QEMU has completed migration and got its vCPUs
> running, in order to have best chance of successful rollback upon
> failure
> 
> So assuming both QEMU's can have the FD open, provided they don't
> both concurrently operate on it, could src QEMU just pass the FDs
> to the target QEMU as part of the migration stream. eg use a UNIX
> socket between the 2 QEMUs, and SCM_RIGHTS to pass the FDs across,
> avoiding libvirt needing to be in the middle of the FD passing
> dance. Since target QEMU gets the FDs as part of the migration
> stream, it would inherantly know that it shold skip device reset
> in that flow, without requiring any new param.
> 

Yeah, I had similar idea, but this would require a lot of rework of VFIO 
initialization
phase in QEMU. The main problem here is all initialization happens on device 
addition, which will fail
if device already used by another QEMU. I guess we would  need to move lot's of 
initialization to the
->post_load() hook.

Reply via email to