On 13/10/17 19:59, Will Deacon wrote:
> Hi Robin,
> 
> Some of my comments on patch 3 are addressed here, but I'm really struggling
> to convince myself that this algorithm is correct. My preference would
> be to leave the code as it is for SMMUs that don't implement MSIs, but
> comments below anyway because it's an interesting idea.

>From scrounging up boot logs I can see that neither the Cavium nor
HiSilicon SMMUv3 implementations have MSI support (the one in D05
doesn't even have WFE), so there is a real motivation to improve
scalability on current systems - it's not *just* a cool feature!

> On Thu, Aug 31, 2017 at 02:44:28PM +0100, Robin Murphy wrote:
>> Even without the MSI trick, we can still do a lot better than hogging
>> the entire queue while it drains. All we actually need to do for the
>> necessary guarantee of completion is wait for our particular command to
>> have been consumed, and as long as we keep track of where it is there is
>> no need to block other CPUs from adding further commands in the
>> meantime. There is one theoretical (but incredibly unlikely) edge case
>> to avoid, where cons has wrapped twice to still appear 'behind' the sync
>> position - this is easily disambiguated by adding a generation count to
>> the queue to indicate when prod wraps, since cons cannot wrap twice
>> without prod having wrapped at least once.
>>
>> Signed-off-by: Robin Murphy <[email protected]>
>> ---
>>
>> v2: New
>>
>>  drivers/iommu/arm-smmu-v3.c | 81 
>> ++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 58 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index 311f482b93d5..f5c5da553803 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -417,7 +417,6 @@
>>  
>>  /* High-level queue structures */
>>  #define ARM_SMMU_POLL_TIMEOUT_US    100
>> -#define ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US      1000000 /* 1s! */
>>  #define ARM_SMMU_SYNC_TIMEOUT_US    1000000 /* 1s! */
>>  
>>  #define MSI_IOVA_BASE                       0x8000000
>> @@ -540,6 +539,7 @@ struct arm_smmu_queue {
>>  struct arm_smmu_cmdq {
>>      struct arm_smmu_queue           q;
>>      spinlock_t                      lock;
>> +    int                             generation;
>>  };
>>  
>>  struct arm_smmu_evtq {
>> @@ -770,21 +770,12 @@ static void queue_inc_prod(struct arm_smmu_queue *q)
>>      writel(q->prod, q->prod_reg);
>>  }
>>  
>> -/*
>> - * Wait for the SMMU to consume items. If drain is true, wait until the 
>> queue
>> - * is empty. Otherwise, wait until there is at least one free slot.
>> - */
>> -static int queue_poll_cons(struct arm_smmu_queue *q, bool drain, bool wfe)
>> +/* Wait for the SMMU to consume items, until there is at least one free 
>> slot */
>> +static int queue_poll_cons(struct arm_smmu_queue *q, bool wfe)
>>  {
>> -    ktime_t timeout;
>> -    unsigned int delay = 1;
>> +    ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US);
>>  
>> -    /* Wait longer if it's queue drain */
>> -    timeout = ktime_add_us(ktime_get(), drain ?
>> -                                        ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US :
>> -                                        ARM_SMMU_POLL_TIMEOUT_US);
>> -
>> -    while (queue_sync_cons(q), (drain ? !queue_empty(q) : queue_full(q))) {
>> +    while (queue_sync_cons(q), queue_full(q)) {
>>              if (ktime_compare(ktime_get(), timeout) > 0)
>>                      return -ETIMEDOUT;
>>  
>> @@ -792,8 +783,7 @@ static int queue_poll_cons(struct arm_smmu_queue *q, 
>> bool drain, bool wfe)
>>                      wfe();
>>              } else {
>>                      cpu_relax();
>> -                    udelay(delay);
>> -                    delay *= 2;
>> +                    udelay(1);
>>              }
>>      }
>>  
>> @@ -959,15 +949,20 @@ static void arm_smmu_cmdq_skip_err(struct 
>> arm_smmu_device *smmu)
>>      queue_write(Q_ENT(q, cons), cmd, q->ent_dwords);
>>  }
>>  
>> -static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd)
>> +static u32 arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd)
>>  {
>>      struct arm_smmu_queue *q = &smmu->cmdq.q;
>>      bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
>>  
>>      while (queue_insert_raw(q, cmd) == -ENOSPC) {
>> -            if (queue_poll_cons(q, false, wfe))
>> +            if (queue_poll_cons(q, wfe))
>>                      dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
>>      }
>> +
>> +    if (Q_IDX(q, q->prod) == 0)
>> +            smmu->cmdq.generation++;
> 
> The readers of generation are using READ_ONCE, so you're missing something
> here.

Yeah, I was a bit back-and-forth on this. The readers want a READ_ONCE
if only to prevent it being hoisted out of the polling loop, but as long
as the update of generation is single-copy-atomic, the exact point at
which it occurs shouldn't matter so much, since it's only written under
the cmdq lock. I guess it depends how much we trust GCC's claim of the
atomicity of int - I have no great objection to

        smmu->cmdq.generation = WRITE_ONCE(smmu->cmdq.generation + 1);

other than it being really long.

>> +
>> +    return q->prod;
>>  }
>>  
>>  static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
>> @@ -997,15 +992,54 @@ static int arm_smmu_sync_poll_msi(struct 
>> arm_smmu_device *smmu, u32 sync_idx)
>>      return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0;
>>  }
>>  
>> +static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 
>> sync_idx,
>> +                               int sync_gen)
>> +{
>> +    ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US);
>> +    struct arm_smmu_queue *q = &smmu->cmdq.q;
>> +    bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
>> +    unsigned int delay = 1;
>> +
>> +    do {
>> +            queue_sync_cons(q);
>> +            /*
>> +             * If we see updates quickly enough, cons has passed sync_idx,
>> +             * but not yet wrapped. At worst, cons might have actually
>> +             * wrapped an even number of times, but that still guarantees
>> +             * the original sync must have been consumed.
>> +             */
>> +            if (Q_IDX(q, q->cons) >= Q_IDX(q, sync_idx) &&
>> +                Q_WRP(q, sync_idx) == Q_WRP(q, q->cons))
> 
> Can you move this into a separate function please, like we have for
> queue_full and queue_empty?

OK, but given that it's only half of the "has cons moved past this
index" operation, I'm not really sure what it could be called -
queue_ahead_not_wrapped() comes to mind, but still seems a bit cryptic.

>> +                    return 0;
>> +            /*
>> +             * Otherwise, cons may have passed sync_idx and wrapped one or
>> +             * more times to appear behind it again, but in that case prod
>> +             * must also be one or more generations ahead.
>> +             */
>> +            if (Q_IDX(q, q->cons) < Q_IDX(q, sync_idx) &&
>> +                READ_ONCE(smmu->cmdq.generation) != sync_gen)
> 
> There's another daft overflow case here which deserves a comment (and even
> if it *did* happen, we'll recover gracefully).

You mean exactly 2^32 queue generations passing between polls? Yeah, not
happening :P

>> +                    return 0;
>> +
>> +            if (wfe) {
>> +                    wfe();
> 
> This is a bit scary... if we miss a generation update, just due to store
> propagation latency (maybe it's buffered by the updater), I think we can
> end up going into wfe when there's not an event pending. Using xchg
> everywhere might work, but there's still a race on having somebody update
> generation. The ordering here just looks generally suspicious to me because
> you have the generation writer in arm_smmu_cmdq_insert_cmd effectively
> doing:
> 
>   Write prod
>   Write generation
> 
> and the reader in arm_smmu_sync_poll_cons doing:
> 
>   Read cons
>   Read generation
> 
> so I can't see anything that gives you order to guarantee that the
> generation is seen to be up-to-date.

On reflection I'm pretty sure the below should suffice, to piggy-back
off the DSB implicit in queue_insert_raw(). The readers only care about
the generation *after* cons has been observed to go backwards, so the
exact order of the generation and prod updates makes no practical
difference other than closing that race window.

Robin

----->8-----
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 12cdc5e50675..78ba8269c44c 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -959,14 +959,14 @@ static u32 arm_smmu_cmdq_insert_cmd(struct
arm_smmu_device *smmu, u64 *cmd)
        struct arm_smmu_queue *q = &smmu->cmdq.q;
        bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);

+       if (Q_IDX(q, q->prod + 1) == 0)
+               smmu->cmdq.generation++;
+
        while (queue_insert_raw(q, cmd) == -ENOSPC) {
                if (queue_poll_cons(q, wfe))
                        dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
        }

-       if (Q_IDX(q, q->prod) == 0)
-               smmu->cmdq.generation++;
-
        return q->prod;
 }

_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to