On 2018/5/31 22:25, Robin Murphy wrote:
> On 31/05/18 14:49, Hanjun Guo wrote:
>> Hi Robin,
>>
>> On 2018/5/31 19:24, Robin Murphy wrote:
>>> On 31/05/18 08:42, Zhen Lei wrote:
>>>> In common, a IOMMU unmap operation follow the below steps:
>>>> 1. remove the mapping in page table of the specified iova range
>>>> 2. execute tlbi command to invalid the mapping which is cached in TLB
>>>> 3. wait for the above tlbi operation to be finished
>>>> 4. free the IOVA resource
>>>> 5. free the physical memory resource
>>>>
>>>> This maybe a problem when unmap is very frequently, the combination of tlbi
>>>> and wait operation will consume a lot of time. A feasible method is put off
>>>> tlbi and iova-free operation, when accumulating to a certain number or
>>>> reaching a specified time, execute only one tlbi_all command to clean up
>>>> TLB, then free the backup IOVAs. Mark as non-strict mode.
>>>>
>>>> But it must be noted that, although the mapping has already been removed in
>>>> the page table, it maybe still exist in TLB. And the freed physical memory
>>>> may also be reused for others. So a attacker can persistent access to
>>>> memory
>>>> based on the just freed IOVA, to obtain sensible data or corrupt memory. So
>>>> the VFIO should always choose the strict mode.
>>>>
>>>> Some may consider put off physical memory free also, that will still follow
>>>> strict mode. But for the map_sg cases, the memory allocation is not
>>>> controlled
>>>> by IOMMU APIs, so it is not enforceable.
>>>>
>>>> Fortunately, Intel and AMD have already applied the non-strict mode, and
>>>> put
>>>> queue_iova() operation into the common file dma-iommu.c., and my work is
>>>> based
>>>> on it. The difference is that arm-smmu-v3 driver will call IOMMU common
>>>> APIs to
>>>> unmap, but Intel and AMD IOMMU drivers are not.
>>>>
>>>> Below is the performance data of strict vs non-strict for NVMe device:
>>>> Randomly Read IOPS: 146K(strict) vs 573K(non-strict)
>>>> Randomly Write IOPS: 143K(strict) vs 513K(non-strict)
>>>
>>> What hardware is this on? If it's SMMUv3 without MSIs (e.g. D05), then
>>> you'll still be using the rubbish globally-blocking sync implementation. If
>>> that is the case, I'd be very interested to see how much there is to gain
>>> from just improving that - I've had a patch kicking around for a while[1]
>>> (also on a rebased branch at [2]), but don't have the means for serious
>>> performance testing.
I will try your patch to see how much it can improve. I think the best way
to resovle the globally-blocking sync is that the hardware provide 64bits
CONS regitster, so that it can never be wrapped, and the spinlock can also
be removed.
>>
>> The hardware is the new D06 which the SMMU with MSIs,
>
> Cool! Now that profiling is fairly useful since we got rid of most of the
> locks, are you able to get an idea of how the overhead in the normal case is
> distributed between arm_smmu_cmdq_insert_cmd() and
> __arm_smmu_sync_poll_msi()? We're always trying to improve our understanding
> of where command-queue-related overheads turn out to be in practice, and
> there's still potentially room to do nicer things than TLBI_NH_ALL ;)
Even if the software has no overhead, there may still be a problem, because
the smmu need to execute the commands in sequence, especially before
globally-blocking sync has been removed. Base on the actually execute time
of single tlbi and sync, we can get the upper limit in theory.
BTW, I will reply the reset of mail next week. I'm busy with other things now.
>
> Robin.
>
>> it's not D05 :)
>>
>> Thanks
>> Hanjun
>>
>
> .
>
--
Thanks!
BestRegards
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu