On 2018-01-12 22:06, Gustavo Lima Chaves wrote:
> * Jan Kiszka <[email protected]> [2018-01-12 13:13:25 +0100]:
>
>> On 2018-01-12 01:22, Gustavo Lima Chaves wrote:
>>> Try our best to follow more closely what the specification mandates for
>>> function-level reset (section 6.6.2 of PCI Express Base Specification,
>>> Revision 3.0). The previous approach was writing blindly on the command
>>> register (setting int. disable bit), but that would lead to an
>>> unsupported request on functions that do not expect that right away.
>>>
>>> We now do checks for FLR support, proceed to the previous command
>>> register write, wait for any pending transactions for some cycles (used
>>> as a placeholder for the suggested 100ms timeout), raise the 'initiate
>>> FLR' bit on the right register, wait on the same kind of loop again and,
>>> finally, clear the transactions pending bit forcefully, as mandated.
>>>
>>> This should leave us with better support for PCI devices out there.
>>>
>>> Signed-off-by: Gustavo Lima Chaves <[email protected]>
>>> ---
>>> hypervisor/include/jailhouse/pci.h | 5 +++
>>> hypervisor/pci.c | 82
>>> +++++++++++++++++++++++++++++++++++---
>>> 2 files changed, 82 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hypervisor/include/jailhouse/pci.h
>>> b/hypervisor/include/jailhouse/pci.h
>>> index 6d256613..00d52099 100644
>>> --- a/hypervisor/include/jailhouse/pci.h
>>> +++ b/hypervisor/include/jailhouse/pci.h
>>> @@ -40,7 +40,12 @@
>>> #define PCI_CAP_EXPRESS 0x10
>>>
>>> #define PCI_CAP_PCIE (0x10 | JAILHOUSE_PCI_EXT_CAP)
>>> +#define PCIE_DEV_CAPS_REG 0x04
>>> +# define PCI_DEV_CAPS_FLR (1 << 28)
>>> #define PCIE_CONTROL_REG 0x08
>>> +# define PCI_CONTROL_INIT_FLR (1 << 15)
>>> +#define PCIE_STATUS_REG 0x0a
>>> +# define PCI_STATUS_PENDING (1 << 5)
>>>
>>> #define PCIE_AER_FLAGS (PCIE_DEVCTL_CERE | PCIE_DEVCTL_NFERE | \
>>> PCIE_DEVCTL_FERE | PCIE_DEVCTL_URRE)
>>> diff --git a/hypervisor/pci.c b/hypervisor/pci.c
>>> index 30b733c9..f74f6399 100644
>>> --- a/hypervisor/pci.c
>>> +++ b/hypervisor/pci.c
>>> @@ -21,6 +21,8 @@
>>>
>>> #define MSIX_VECTOR_CTRL_DWORD 3
>>>
>>> +#define PCI_RESET_TIMEOUT 250
>>
>> Where was this value derived from? What is its unit (it's apparently
>> "loops" or "cycles")?
>
> The spec suggests:
>
> "After an FLR has been initiated by writing a 1b to the Initiate
> Function Level Reset bit, the Function must complete the FLR within
> 100 ms. If software initiates an FLR when the Transactions Pending
> bit is 1b, then software must not initialize the Function until
> allowing adequate time for any associated Completions to arrive, or
> to achieve reasonable certainty that any remaining Completions will
> never arrive. For this purpose, it is recommended that software
> allow as much time as provided by the pre-FLR value for Completion
> Timeout on the device. If Completion Timeouts were disabled on the
> Function when FLR was issued, then the delay is system dependent but
> must be no less than 100 ms."
>
> Also:
>
> "1. Software that's performing the FLR synchronizes with other
> software that might potentially access the Function directly, and
> ensures such accesses do not occur during this algorithm.
>
> 2. Software clears the entire Command register, disabling the
> Function from issuing any new Requests.
>
> 3. Software polls the Transactions Pending bit in the Device Status
> register either until it is clear or until it has been long enough
> that software is reasonably certain that Completions associated with
> any remaining outstanding Transactions will never arrive. On many
> platforms, the Transactions Pending bit will usually clear within a
> few milliseconds, so software might choose to poll during this
> initial period using a tight software loop. On rare cases when the
> Transactions Pending bit does not clear by this time, software will
> need to poll for a much longer platform-specific period (potentially
> seconds), so software might choose to conduct this polling using a
> timer-based interrupt polling mechanism.
>
> 4. Software initiates the FLR.
>
> 5. Software waits 100 ms.
>
> 6. Software reconfigures the Function and enables it for normal
> operation."
>
> Since we don't have timers at hypervisor level, I though it would be
> reasonable to start with a cycles count on some arbitrary but fair
> amount. Suggestions?
Busy-waiting for hundreds of millisecond (however we measure them) is
not a good thing. If that is really required, we should at least issue
all commands to all affected functions in parallel. What is the normal
overall time needed to complete a function reset?
Measuring these periods without a time base, like we have right now, is
challenging. We avoided introducing a time base for the hypevisor-guest
communication protocol by letting the user specify a loop-based timeout
per machine, to compensate the different machine speeds. Hardcoding
loop-based timeouts is not the way to go unless you have a more or less
known delay in the loop body (like a PCI access).
>
>>
>> Style: indention.
>>
>>> +
>>> #define for_each_configured_pci_device(dev, cell) \
>>> for ((dev) = (cell)->pci_devices; \
>>> (dev) - (cell)->pci_devices < (cell)->config->num_pci_devices; \
>>> @@ -618,10 +620,13 @@ void pci_prepare_handover(void)
>>> }
>>> }
>>>
>>> -void pci_reset_device(struct pci_device *device)
>>> +static void pci_reset_device_internal(struct pci_device *device)
>>> {
>>> const struct jailhouse_pci_capability *cap;
>>> + unsigned timeout = PCI_RESET_TIMEOUT;
>>> + bool found = false;
>>> unsigned int n;
>>> + u32 val;
>>>
>>> memset(&device->msi_registers, 0, sizeof(device->msi_registers));
>>> for (n = 0; n < device->info->num_msix_vectors; n++) {
>>> @@ -635,13 +640,81 @@ void pci_reset_device(struct pci_device *device)
>>> return;
>>> }
>>>
>>> + for_each_pci_cap(cap, device, n) {
>>> + if (cap->id == PCI_CAP_PCIE) {
>>> + found = true;
>>> + break;
>>> + }
>>> + }
>>> +
>>> + if (!found)
>>> + return;
>>
>> Another use case for pci_get_cap().
>
> Done.
>
>>
>>> +
>>> + val = pci_read_config(device->info->bdf,
>>> + cap->start + PCIE_DEV_CAPS_REG, 4);
>>> +
>>> + if ((val & PCI_DEV_CAPS_FLR) != PCI_DEV_CAPS_FLR) {
>>> + printk("Skipping function level reset for "
>>> + "device %02x:%02x.%x: no support\n",
>>> + PCI_BDF_PARAMS(device->info->bdf));
>>> + return;
>>> + }
>>> +
>>> /*
>>> - * Silence INTx of the physical device by setting the mask bit.
>>> - * This is a deviation from the specified reset state.
>>> + * "When a 0 is written to this register, the device is
>>> + * logically disconnected from the PCI bus for all accesses
>>> + * except configuration accesses."
>>> + *
>>> + * The bit that we set on that mask will silence INTx of the
>>> + * physical device.
>>> */
>>> pci_write_config(device->info->bdf, PCI_CFG_COMMAND,
>>> PCI_CMD_INTX_OFF, 2);
>>>
>>> + /*
>>> + * Wait for any pending transaction to end. Ideally we should
>>> + * count for 100ms, but we count arbitrary loops for now
>>> + * instead, by lack of timer notion in the hypervisor.
>>> + */
>>> + do {
>>> + val = pci_read_config(device->info->bdf,
>>> + cap->start + PCIE_STATUS_REG, 2);
>>> + cpu_relax();
>>> + timeout--;
>>> + } while ((val & PCI_STATUS_PENDING) == PCI_STATUS_PENDING
>>> + || timeout);
>>
>> while (val & PCI_STATUS_PENDING || timeout > 0)
>>
>> But what if timeout reaches 0? Just ignore STATUS_PENDING??
>
> Just like the spec suggests, yes, we may abort if the devices takes
> too long (that should not happen in first place).
Abort what? Waiting for the device or even talking to it at all (i.e.
failing it)?
>
>>
>>> +
>>> + /*
>>> + * Set "initiate function level reset" bit and wait a little
>>> + * bit more.
>>> + */
>>> + val = pci_read_config(device->info->bdf,
>>> + cap->start + PCIE_CONTROL_REG, 2);
>>
>> Redundant, we have the current value already.
>
> From control reg? Where?
Sorry, confused status and control reg.
>
>>
>>> + pci_write_config(device->info->bdf, cap->start + PCIE_CONTROL_REG,
>>> + val & PCI_CONTROL_INIT_FLR, 2);
>>> + timeout = PCI_RESET_TIMEOUT;
>>> + while (timeout) {
>>> + cpu_relax();
>>> + timeout--;
>>> + }
>>
>> These two timeout loops have totally different time spans because one is
>> throttled by PCI accesses while the other is just a trivial counter. I
>> suspect we can either drop that second loop completely or have to
>> properly throttle it as well.
>>
>> What does the spec say here?
>
> You're right about the timing differences. What do you suggest to make
> them more equal? Non used cap reads inside the second one? Maybe I'll
> read the status reg inside it already.
If we can derive some reliable time base from PCI accesses, that could
do the trick.
>
>>
>>> +
>>> + /*
>>> + * "The Transactions Pending bit must be cleared upon completion
>>> + * of the FLR."
>>> + */
>>> + val = pci_read_config(device->info->bdf,
>>> + cap->start + PCIE_STATUS_REG, 2);
>>
>> Again: don't we have the current value already?
>
> The status could have changed. I'll do as stated in the last comment,
> for now.
>
>>
>>> + pci_write_config(device->info->bdf, cap->start + PCIE_STATUS_REG,
>>> + val & ~PCI_STATUS_PENDING, 2);
>>> +}
>>> +
>>> +void pci_reset_device(struct pci_device *device)
>>> +{
>>> + const struct jailhouse_pci_capability *cap;
>>> + unsigned int n;
>>> +
>>> + pci_reset_device_internal(device);
>>> +
>>> for_each_pci_cap(cap, device, n) {
>>> if (cap->id == PCI_CAP_MSI || cap->id == PCI_CAP_MSIX)
>>> /* Disable MSI/MSI-X by clearing the control word. */
>>> @@ -937,7 +1010,6 @@ void pci_shutdown(void)
>>> pci_restore_msix(device, cap);
>>>
>>> if (device->cell != &root_cell)
>>> - pci_write_config(device->info->bdf, PCI_CFG_COMMAND,
>>> - PCI_CMD_INTX_OFF, 2);
>>> + pci_reset_device_internal(device);
>>> }
>>> }
>>>
>>
>> Jan
>>
>> --
>> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
>> Corporate Competence Center Embedded Linux
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Jailhouse" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> Gustavo Lima Chaves
> Intel - Open Source Technology Center
>
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
--
You received this message because you are subscribed to the Google Groups
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.