Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root port

Peter Lieven Sat, 13 Dec 2014 12:45:27 -0800

Am 12.12.2014 um 23:21 schrieb Alex Williamson:
> On Fri, 2014-12-12 at 22:38 +0100, Peter Lieven wrote:
>> Hi,
>>
>> we have a Cisco UCS infrastructure where we have fnic Fibre-Channel Adapters 
>> that we expose to guests. The UCS
>> infrastruture allows to create virtual HBAs that can be exposed to a host so 
>> its possible to have quite a lot of them.
>>
>> We ran into a strange issue when we started having more than one vServer 
>> with a FibreChannel Adapter passed
>> thru with vfio-pci.
>>
>> When a hypervisor shuts down it the kernel sees the following error:
>>
>>  pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: id=0038
>>  pcieport 0000:00:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), 
>> type=Transaction Layer, id=0038(Receiver ID)
>>  pcieport 0000:00:07.0:   device [8086:340e] error 
>> status/mask=00200000/00100000
>>  pcieport 0000:00:07.0:    [21] Unknown Error Bit (First)
>>  pcieport 0000:00:07.0: broadcast error_detected message
>>  pcieport 0000:00:07.0: AER: Device recovery failed
>>
>> Bit 21 seems to be ACS Violation. And 0000:00:07.0 is the PCIE Root Port on 
>> that System.
>>
>> This wouldn't be a big problem, altough I would like to find out what the 
>> ACS Violation causes.
>>
>> The real problem is that all other vfio-pci cards on that root port get 
>> notified of this error and the connected vServers are suspended
>> with RUN_STATE_INTERNAL_ERROR.
>>
>> Any ideas to work around this other than hacking qemu to not register an 
>> error handler or modifying vfio_err_notifier_handler
>> to not suspend the vServer?
> You could set bit 21 in the AER uncorrected error mask register to avoid
> the root port signaling the error.  Is bit 21 already clear in the
> severity register to make this non-fatal?
>
>> Is it correct that all children of a root port are notified? Should qemu 
>> distinguish between fatal and non-fatal errors when
>> suspending a vServer?
> Yes, each child is notified.  QEMU only gets an eventfd signal, which is
> supposed to occur only for fatal errors.  I don't quite understand why
> this apparently non-fatal error is getting through.  The kernel-side
> VFIO code is where filtering of fatal vs non-fatal should occur.


Had a look at vfio-pci.c from master. I can't see where there is a filtering of 
fatal vs. non-fatal

static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
                                                  pci_channel_state_t state)
{
        struct vfio_pci_device *vdev;
        struct vfio_device *device;

        device = vfio_device_get_from_dev(&pdev->dev);
        if (device == NULL)
                return PCI_ERS_RESULT_DISCONNECT;

        vdev = vfio_device_data(device);
        if (vdev == NULL) {
                vfio_device_put(device);
                return PCI_ERS_RESULT_DISCONNECT;
        }

        mutex_lock(&vdev->igate);

        if (vdev->err_trigger)
                eventfd_signal(vdev->err_trigger, 1);

        mutex_unlock(&vdev->igate);

        vfio_device_put(device);

        return PCI_ERS_RESULT_CAN_RECOVER;
}

static struct pci_error_handlers vfio_err_handlers = {
        .error_detected = vfio_pci_aer_err_detected,
};

static struct pci_driver vfio_pci_driver = {
        .name           = "vfio-pci",
        .id_table       = NULL, /* only dynamic ids */
        .probe          = vfio_pci_probe,
        .remove         = vfio_pci_remove,
        .err_handler    = &vfio_err_handlers,
};

Peter

Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root port

Reply via email to