On 13/10/17 19:59, Will Deacon wrote: > Hi Robin, > > Some of my comments on patch 3 are addressed here, but I'm really struggling > to convince myself that this algorithm is correct. My preference would > be to leave the code as it is for SMMUs that don't implement MSIs, but > comments below anyway because it's an interesting idea.
>From scrounging up boot logs I can see that neither the Cavium nor HiSilicon SMMUv3 implementations have MSI support (the one in D05 doesn't even have WFE), so there is a real motivation to improve scalability on current systems - it's not *just* a cool feature! > On Thu, Aug 31, 2017 at 02:44:28PM +0100, Robin Murphy wrote: >> Even without the MSI trick, we can still do a lot better than hogging >> the entire queue while it drains. All we actually need to do for the >> necessary guarantee of completion is wait for our particular command to >> have been consumed, and as long as we keep track of where it is there is >> no need to block other CPUs from adding further commands in the >> meantime. There is one theoretical (but incredibly unlikely) edge case >> to avoid, where cons has wrapped twice to still appear 'behind' the sync >> position - this is easily disambiguated by adding a generation count to >> the queue to indicate when prod wraps, since cons cannot wrap twice >> without prod having wrapped at least once. >> >> Signed-off-by: Robin Murphy <[email protected]> >> --- >> >> v2: New >> >> drivers/iommu/arm-smmu-v3.c | 81 >> ++++++++++++++++++++++++++++++++------------- >> 1 file changed, 58 insertions(+), 23 deletions(-) >> >> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c >> index 311f482b93d5..f5c5da553803 100644 >> --- a/drivers/iommu/arm-smmu-v3.c >> +++ b/drivers/iommu/arm-smmu-v3.c >> @@ -417,7 +417,6 @@ >> >> /* High-level queue structures */ >> #define ARM_SMMU_POLL_TIMEOUT_US 100 >> -#define ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US 1000000 /* 1s! */ >> #define ARM_SMMU_SYNC_TIMEOUT_US 1000000 /* 1s! */ >> >> #define MSI_IOVA_BASE 0x8000000 >> @@ -540,6 +539,7 @@ struct arm_smmu_queue { >> struct arm_smmu_cmdq { >> struct arm_smmu_queue q; >> spinlock_t lock; >> + int generation; >> }; >> >> struct arm_smmu_evtq { >> @@ -770,21 +770,12 @@ static void queue_inc_prod(struct arm_smmu_queue *q) >> writel(q->prod, q->prod_reg); >> } >> >> -/* >> - * Wait for the SMMU to consume items. If drain is true, wait until the >> queue >> - * is empty. Otherwise, wait until there is at least one free slot. >> - */ >> -static int queue_poll_cons(struct arm_smmu_queue *q, bool drain, bool wfe) >> +/* Wait for the SMMU to consume items, until there is at least one free >> slot */ >> +static int queue_poll_cons(struct arm_smmu_queue *q, bool wfe) >> { >> - ktime_t timeout; >> - unsigned int delay = 1; >> + ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US); >> >> - /* Wait longer if it's queue drain */ >> - timeout = ktime_add_us(ktime_get(), drain ? >> - ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US : >> - ARM_SMMU_POLL_TIMEOUT_US); >> - >> - while (queue_sync_cons(q), (drain ? !queue_empty(q) : queue_full(q))) { >> + while (queue_sync_cons(q), queue_full(q)) { >> if (ktime_compare(ktime_get(), timeout) > 0) >> return -ETIMEDOUT; >> >> @@ -792,8 +783,7 @@ static int queue_poll_cons(struct arm_smmu_queue *q, >> bool drain, bool wfe) >> wfe(); >> } else { >> cpu_relax(); >> - udelay(delay); >> - delay *= 2; >> + udelay(1); >> } >> } >> >> @@ -959,15 +949,20 @@ static void arm_smmu_cmdq_skip_err(struct >> arm_smmu_device *smmu) >> queue_write(Q_ENT(q, cons), cmd, q->ent_dwords); >> } >> >> -static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd) >> +static u32 arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd) >> { >> struct arm_smmu_queue *q = &smmu->cmdq.q; >> bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV); >> >> while (queue_insert_raw(q, cmd) == -ENOSPC) { >> - if (queue_poll_cons(q, false, wfe)) >> + if (queue_poll_cons(q, wfe)) >> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); >> } >> + >> + if (Q_IDX(q, q->prod) == 0) >> + smmu->cmdq.generation++; > > The readers of generation are using READ_ONCE, so you're missing something > here. Yeah, I was a bit back-and-forth on this. The readers want a READ_ONCE if only to prevent it being hoisted out of the polling loop, but as long as the update of generation is single-copy-atomic, the exact point at which it occurs shouldn't matter so much, since it's only written under the cmdq lock. I guess it depends how much we trust GCC's claim of the atomicity of int - I have no great objection to smmu->cmdq.generation = WRITE_ONCE(smmu->cmdq.generation + 1); other than it being really long. >> + >> + return q->prod; >> } >> >> static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu, >> @@ -997,15 +992,54 @@ static int arm_smmu_sync_poll_msi(struct >> arm_smmu_device *smmu, u32 sync_idx) >> return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0; >> } >> >> +static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 >> sync_idx, >> + int sync_gen) >> +{ >> + ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US); >> + struct arm_smmu_queue *q = &smmu->cmdq.q; >> + bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV); >> + unsigned int delay = 1; >> + >> + do { >> + queue_sync_cons(q); >> + /* >> + * If we see updates quickly enough, cons has passed sync_idx, >> + * but not yet wrapped. At worst, cons might have actually >> + * wrapped an even number of times, but that still guarantees >> + * the original sync must have been consumed. >> + */ >> + if (Q_IDX(q, q->cons) >= Q_IDX(q, sync_idx) && >> + Q_WRP(q, sync_idx) == Q_WRP(q, q->cons)) > > Can you move this into a separate function please, like we have for > queue_full and queue_empty? OK, but given that it's only half of the "has cons moved past this index" operation, I'm not really sure what it could be called - queue_ahead_not_wrapped() comes to mind, but still seems a bit cryptic. >> + return 0; >> + /* >> + * Otherwise, cons may have passed sync_idx and wrapped one or >> + * more times to appear behind it again, but in that case prod >> + * must also be one or more generations ahead. >> + */ >> + if (Q_IDX(q, q->cons) < Q_IDX(q, sync_idx) && >> + READ_ONCE(smmu->cmdq.generation) != sync_gen) > > There's another daft overflow case here which deserves a comment (and even > if it *did* happen, we'll recover gracefully). You mean exactly 2^32 queue generations passing between polls? Yeah, not happening :P >> + return 0; >> + >> + if (wfe) { >> + wfe(); > > This is a bit scary... if we miss a generation update, just due to store > propagation latency (maybe it's buffered by the updater), I think we can > end up going into wfe when there's not an event pending. Using xchg > everywhere might work, but there's still a race on having somebody update > generation. The ordering here just looks generally suspicious to me because > you have the generation writer in arm_smmu_cmdq_insert_cmd effectively > doing: > > Write prod > Write generation > > and the reader in arm_smmu_sync_poll_cons doing: > > Read cons > Read generation > > so I can't see anything that gives you order to guarantee that the > generation is seen to be up-to-date. On reflection I'm pretty sure the below should suffice, to piggy-back off the DSB implicit in queue_insert_raw(). The readers only care about the generation *after* cons has been observed to go backwards, so the exact order of the generation and prod updates makes no practical difference other than closing that race window. Robin ----->8----- diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 12cdc5e50675..78ba8269c44c 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -959,14 +959,14 @@ static u32 arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd) struct arm_smmu_queue *q = &smmu->cmdq.q; bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV); + if (Q_IDX(q, q->prod + 1) == 0) + smmu->cmdq.generation++; + while (queue_insert_raw(q, cmd) == -ENOSPC) { if (queue_poll_cons(q, wfe)) dev_err_ratelimited(smmu->dev, "CMDQ timeout\n"); } - if (Q_IDX(q, q->prod) == 0) - smmu->cmdq.generation++; - return q->prod; } _______________________________________________ iommu mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/iommu
