On 2018/5/31 22:25, Robin Murphy wrote:
> On 31/05/18 14:49, Hanjun Guo wrote:
>> Hi Robin,
>>
>> On 2018/5/31 19:24, Robin Murphy wrote:
>>> On 31/05/18 08:42, Zhen Lei wrote:
>>>> In common, a IOMMU unmap operation follow the below steps:
>>>> 1. remove the mapping in page table of the specified iova range
>>>> 2. execute tlbi command to invalid the mapping which is cached in TLB
>>>> 3. wait for the above tlbi operation to be finished
>>>> 4. free the IOVA resource
>>>> 5. free the physical memory resource
>>>>
>>>> This maybe a problem when unmap is very frequently, the combination of tlbi
>>>> and wait operation will consume a lot of time. A feasible method is put off
>>>> tlbi and iova-free operation, when accumulating to a certain number or
>>>> reaching a specified time, execute only one tlbi_all command to clean up
>>>> TLB, then free the backup IOVAs. Mark as non-strict mode.
>>>>
>>>> But it must be noted that, although the mapping has already been removed in
>>>> the page table, it maybe still exist in TLB. And the freed physical memory
>>>> may also be reused for others. So a attacker can persistent access to 
>>>> memory
>>>> based on the just freed IOVA, to obtain sensible data or corrupt memory. So
>>>> the VFIO should always choose the strict mode.
>>>>
>>>> Some may consider put off physical memory free also, that will still follow
>>>> strict mode. But for the map_sg cases, the memory allocation is not 
>>>> controlled
>>>> by IOMMU APIs, so it is not enforceable.
>>>>
>>>> Fortunately, Intel and AMD have already applied the non-strict mode, and 
>>>> put
>>>> queue_iova() operation into the common file dma-iommu.c., and my work is 
>>>> based
>>>> on it. The difference is that arm-smmu-v3 driver will call IOMMU common 
>>>> APIs to
>>>> unmap, but Intel and AMD IOMMU drivers are not.
>>>>
>>>> Below is the performance data of strict vs non-strict for NVMe device:
>>>> Randomly Read  IOPS: 146K(strict) vs 573K(non-strict)
>>>> Randomly Write IOPS: 143K(strict) vs 513K(non-strict)
>>>
>>> What hardware is this on? If it's SMMUv3 without MSIs (e.g. D05), then 
>>> you'll still be using the rubbish globally-blocking sync implementation. If 
>>> that is the case, I'd be very interested to see how much there is to gain 
>>> from just improving that - I've had a patch kicking around for a while[1] 
>>> (also on a rebased branch at [2]), but don't have the means for serious 
>>> performance testing.
I will try your patch to see how much it can improve. I think the best way
to resovle the globally-blocking sync is that the hardware provide 64bits
CONS regitster, so that it can never be wrapped, and the spinlock can also
be removed.

>>
>> The hardware is the new D06 which the SMMU with MSIs,
> 
> Cool! Now that profiling is fairly useful since we got rid of most of the 
> locks, are you able to get an idea of how the overhead in the normal case is 
> distributed between arm_smmu_cmdq_insert_cmd() and 
> __arm_smmu_sync_poll_msi()? We're always trying to improve our understanding 
> of where command-queue-related overheads turn out to be in practice, and 
> there's still potentially room to do nicer things than TLBI_NH_ALL ;)
Even if the software has no overhead, there may still be a problem, because
the smmu need to execute the commands in sequence, especially before
globally-blocking sync has been removed. Base on the actually execute time
of single tlbi and sync, we can get the upper limit in theory.

BTW, I will reply the reset of mail next week. I'm busy with other things now.

> 
> Robin.
> 
>> it's not D05 :)
>>
>> Thanks
>> Hanjun
>>
> 
> .
> 

-- 
Thanks!
BestRegards

_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to