Re: [PATCH 4/5] pci: reset: follow more closely PCIe's spec for function-level reset

Jan Kiszka Sun, 14 Jan 2018 09:00:57 -0800

On 2018-01-12 22:06, Gustavo Lima Chaves wrote:
> * Jan Kiszka <[email protected]> [2018-01-12 13:13:25 +0100]:
> 
>> On 2018-01-12 01:22, Gustavo Lima Chaves wrote:
>>> Try our best to follow more closely what the specification mandates for
>>> function-level reset (section 6.6.2 of PCI Express Base Specification,
>>> Revision 3.0). The previous approach was writing blindly on the command
>>> register (setting int. disable bit), but that would lead to an
>>> unsupported request on functions that do not expect that right away.
>>>
>>> We now do checks for FLR support, proceed to the previous command
>>> register write, wait for any pending transactions for some cycles (used
>>> as a placeholder for the suggested 100ms timeout), raise the 'initiate
>>> FLR' bit on the right register, wait on the same kind of loop again and,
>>> finally, clear the transactions pending bit forcefully, as mandated.
>>>
>>> This should leave us with better support for PCI devices out there.
>>>
>>> Signed-off-by: Gustavo Lima Chaves <[email protected]>
>>> ---
>>>  hypervisor/include/jailhouse/pci.h |  5 +++
>>>  hypervisor/pci.c                   | 82 
>>> +++++++++++++++++++++++++++++++++++---
>>>  2 files changed, 82 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hypervisor/include/jailhouse/pci.h 
>>> b/hypervisor/include/jailhouse/pci.h
>>> index 6d256613..00d52099 100644
>>> --- a/hypervisor/include/jailhouse/pci.h
>>> +++ b/hypervisor/include/jailhouse/pci.h
>>> @@ -40,7 +40,12 @@
>>>  #define PCI_CAP_EXPRESS    0x10
>>>
>>>  #define PCI_CAP_PCIE               (0x10 | JAILHOUSE_PCI_EXT_CAP)
>>> +#define PCIE_DEV_CAPS_REG  0x04
>>> +# define PCI_DEV_CAPS_FLR  (1 << 28)
>>>  #define PCIE_CONTROL_REG   0x08
>>> +# define PCI_CONTROL_INIT_FLR      (1 << 15)
>>> +#define PCIE_STATUS_REG            0x0a
>>> +# define PCI_STATUS_PENDING        (1 << 5)
>>>
>>>  #define PCIE_AER_FLAGS     (PCIE_DEVCTL_CERE | PCIE_DEVCTL_NFERE | \
>>>                      PCIE_DEVCTL_FERE | PCIE_DEVCTL_URRE)
>>> diff --git a/hypervisor/pci.c b/hypervisor/pci.c
>>> index 30b733c9..f74f6399 100644
>>> --- a/hypervisor/pci.c
>>> +++ b/hypervisor/pci.c
>>> @@ -21,6 +21,8 @@
>>>
>>>  #define MSIX_VECTOR_CTRL_DWORD             3
>>>
>>> +#define PCI_RESET_TIMEOUT 250
>>
>> Where was this value derived from? What is its unit (it's apparently
>> "loops" or "cycles")?
> 
> The spec suggests:
> 
>   "After an FLR has been initiated by writing a 1b to the Initiate
>   Function Level Reset bit, the Function must complete the FLR within
>   100 ms. If software initiates an FLR when the Transactions Pending
>   bit is 1b, then software must not initialize the Function until
>   allowing adequate time for any associated Completions to arrive, or
>   to achieve reasonable certainty that any remaining Completions will
>   never arrive. For this purpose, it is recommended that software
>   allow as much time as provided by the pre-FLR value for Completion
>   Timeout on the device. If Completion Timeouts were disabled on the
>   Function when FLR was issued, then the delay is system dependent but
>   must be no less than 100 ms."
> 
> Also:
> 
>   "1. Software that's performing the FLR synchronizes with other
>   software that might potentially access the Function directly, and
>   ensures such accesses do not occur during this algorithm.
> 
>   2. Software clears the entire Command register, disabling the
>   Function from issuing any new Requests.
> 
>   3. Software polls the Transactions Pending bit in the Device Status
>   register either until it is clear or until it has been long enough
>   that software is reasonably certain that Completions associated with
>   any remaining outstanding Transactions will never arrive. On many
>   platforms, the Transactions Pending bit will usually clear within a
>   few milliseconds, so software might choose to poll during this
>   initial period using a tight software loop. On rare cases when the
>   Transactions Pending bit does not clear by this time, software will
>   need to poll for a much longer platform-specific period (potentially
>   seconds), so software might choose to conduct this polling using a
>   timer-based interrupt polling mechanism.
> 
>   4. Software initiates the FLR.
> 
>   5. Software waits 100 ms.
> 
>   6. Software reconfigures the Function and enables it for normal
>   operation."
> 
> Since we don't have timers at hypervisor level, I though it would be
> reasonable to start with a cycles count on some arbitrary but fair
> amount. Suggestions?


Busy-waiting for hundreds of millisecond (however we measure them) is
not a good thing. If that is really required, we should at least issue
all commands to all affected functions in parallel. What is the normal
overall time needed to complete a function reset?

Measuring these periods without a time base, like we have right now, is
challenging. We avoided introducing a time base for the hypevisor-guest
communication protocol by letting the user specify a loop-based timeout
per machine, to compensate the different machine speeds. Hardcoding
loop-based timeouts is not the way to go unless you have a more or less
known delay in the loop body (like a PCI access).

> 
>>
>> Style: indention.
>>
>>> +
>>>  #define for_each_configured_pci_device(dev, cell)                  \
>>>     for ((dev) = (cell)->pci_devices;                               \
>>>          (dev) - (cell)->pci_devices < (cell)->config->num_pci_devices; \
>>> @@ -618,10 +620,13 @@ void pci_prepare_handover(void)
>>>     }
>>>  }
>>>
>>> -void pci_reset_device(struct pci_device *device)
>>> +static void pci_reset_device_internal(struct pci_device *device)
>>>  {
>>>     const struct jailhouse_pci_capability *cap;
>>> +   unsigned timeout = PCI_RESET_TIMEOUT;
>>> +   bool found = false;
>>>     unsigned int n;
>>> +   u32 val;
>>>
>>>     memset(&device->msi_registers, 0, sizeof(device->msi_registers));
>>>     for (n = 0; n < device->info->num_msix_vectors; n++) {
>>> @@ -635,13 +640,81 @@ void pci_reset_device(struct pci_device *device)
>>>             return;
>>>     }
>>>
>>> +   for_each_pci_cap(cap, device, n) {
>>> +         if (cap->id == PCI_CAP_PCIE) {
>>> +               found = true;
>>> +               break;
>>> +         }
>>> +   }
>>> +
>>> +   if (!found)
>>> +           return;
>>
>> Another use case for pci_get_cap().
> 
> Done.
> 
>>
>>> +
>>> +   val = pci_read_config(device->info->bdf,
>>> +                         cap->start + PCIE_DEV_CAPS_REG, 4);
>>> +
>>> +   if ((val & PCI_DEV_CAPS_FLR) != PCI_DEV_CAPS_FLR) {
>>> +           printk("Skipping function level reset for "
>>> +                  "device %02x:%02x.%x: no support\n",
>>> +                  PCI_BDF_PARAMS(device->info->bdf));
>>> +           return;
>>> +   }
>>> +
>>>     /*
>>> -    * Silence INTx of the physical device by setting the mask bit.
>>> -    * This is a deviation from the specified reset state.
>>> +    * "When a 0 is written to this register, the device is
>>> +    * logically disconnected from the PCI bus for all accesses
>>> +    * except configuration accesses."
>>> +    *
>>> +    * The bit that we set on that mask will silence INTx of the
>>> +    * physical device.
>>>      */
>>>     pci_write_config(device->info->bdf, PCI_CFG_COMMAND,
>>>                      PCI_CMD_INTX_OFF, 2);
>>>
>>> +   /*
>>> +    * Wait for any pending transaction to end. Ideally we should
>>> +    * count for 100ms, but we count arbitrary loops for now
>>> +    * instead, by lack of timer notion in the hypervisor.
>>> +    */
>>> +   do {
>>> +           val = pci_read_config(device->info->bdf,
>>> +                                 cap->start + PCIE_STATUS_REG, 2);
>>> +           cpu_relax();
>>> +           timeout--;
>>> +   } while ((val & PCI_STATUS_PENDING) == PCI_STATUS_PENDING
>>> +            || timeout);
>>
>> while (val & PCI_STATUS_PENDING || timeout > 0)
>>
>> But what if timeout reaches 0? Just ignore STATUS_PENDING??
> 
> Just like the spec suggests, yes, we may abort if the devices takes
> too long (that should not happen in first place).

Abort what? Waiting for the device or even talking to it at all (i.e.
failing it)?

> 
>>
>>> +
>>> +   /*
>>> +    * Set "initiate function level reset" bit and wait a little
>>> +    * bit more.
>>> +    */
>>> +   val = pci_read_config(device->info->bdf,
>>> +                         cap->start + PCIE_CONTROL_REG, 2);
>>
>> Redundant, we have the current value already.
> 
> From control reg? Where?

Sorry, confused status and control reg.

> 
>>
>>> +   pci_write_config(device->info->bdf, cap->start + PCIE_CONTROL_REG,
>>> +                    val & PCI_CONTROL_INIT_FLR, 2);
>>> +   timeout = PCI_RESET_TIMEOUT;
>>> +   while (timeout) {
>>> +           cpu_relax();
>>> +           timeout--;
>>> +   }
>>
>> These two timeout loops have totally different time spans because one is
>> throttled by PCI accesses while the other is just a trivial counter. I
>> suspect we can either drop that second loop completely or have to
>> properly throttle it as well.
>>
>> What does the spec say here?
> 
> You're right about the timing differences. What do you suggest to make
> them more equal? Non used cap reads inside the second one? Maybe I'll
> read the status reg inside it already.

If we can derive some reliable time base from PCI accesses, that could
do the trick.

> 
>>
>>> +
>>> +   /*
>>> +    * "The Transactions Pending bit must be cleared upon completion
>>> +    * of the FLR."
>>> +    */
>>> +   val = pci_read_config(device->info->bdf,
>>> +                         cap->start + PCIE_STATUS_REG, 2);
>>
>> Again: don't we have the current value already?
> 
> The status could have changed. I'll do as stated in the last comment,
> for now.
> 
>>
>>> +   pci_write_config(device->info->bdf, cap->start + PCIE_STATUS_REG,
>>> +                    val & ~PCI_STATUS_PENDING, 2);
>>> +}
>>> +
>>> +void pci_reset_device(struct pci_device *device)
>>> +{
>>> +   const struct jailhouse_pci_capability *cap;
>>> +   unsigned int n;
>>> +
>>> +   pci_reset_device_internal(device);
>>> +
>>>     for_each_pci_cap(cap, device, n) {
>>>             if (cap->id == PCI_CAP_MSI || cap->id == PCI_CAP_MSIX)
>>>                     /* Disable MSI/MSI-X by clearing the control word. */
>>> @@ -937,7 +1010,6 @@ void pci_shutdown(void)
>>>                             pci_restore_msix(device, cap);
>>>
>>>             if (device->cell != &root_cell)
>>> -                   pci_write_config(device->info->bdf, PCI_CFG_COMMAND,
>>> -                                    PCI_CMD_INTX_OFF, 2);
>>> +                   pci_reset_device_internal(device);
>>>     }
>>>  }
>>>
>>
>> Jan
>>
>> --
>> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
>> Corporate Competence Center Embedded Linux
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "Jailhouse" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
> 
> --
> Gustavo Lima Chaves
> Intel - Open Source Technology Center
> 

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

-- 
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [PATCH 4/5] pci: reset: follow more closely PCIe's spec for function-level reset

Reply via email to