[sorry, I had intended to reply sooner but clearly forgot]
On 16/02/18 00:13, Tomasz Figa wrote:
On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.mur...@arm.com> wrote:
On 15/02/18 04:17, Tomasz Figa wrote:
Could you elaborate on what kind of locking you are concerned about?
As I explained before, the normally happening fast path would lock
dev->power_lock only for the brief moment of incrementing the runtime
PM usage counter.
My bad, that's not even it.
The atomic usage counter is incremented beforehands, without any
locking  and the spinlock is acquired only for the sake of
validating that device's runtime PM state remained valid indeed ,
which would be the case in the fast path of the same driver doing two
mappings in parallel, with the master powered on (and so the SMMU,
through device links; if master was not powered on already, powering
on the SMMU is unavoidable anyway and it would add much more latency
than the spinlock itself).
We now have no locking at all in the map path, and only a per-domain lock
around TLB sync in unmap which is unfortunately necessary for correctness;
the latter isn't too terrible, since in "serious" hardware it should only be
serialising a few cpus serving the same device against each other (e.g. for
multiple queues on a single NIC).
Putting in a global lock which serialises *all* concurrent map and unmap
calls for *all* unrelated devices makes things worse. Period. Even if the
lock itself were held for the minimum possible time, i.e. trivially
"spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing that
one cache line around between 96 CPUs across two sockets is not negligible.
Fair enough. Note that we're in a quite interesting situation now:
a) We need to have runtime PM enabled on Qualcomm SoC to have power
b) We need to have lock-free map/unmap on such distributed systems,
c) If runtime PM is enabled, we need to call into runtime PM from any
code that does hardware accesses, otherwise the IOMMU API (and so DMA
API and then any V4L2 driver) becomes unusable.
I can see one more way that could potentially let us have all the
three. How about enabling runtime PM only on selected implementations
(e.g. qcom,smmu) and then having all the runtime PM calls surrounded
with if (pm_runtime_enabled()), which is lockless?
Yes, that's the kind of thing I was gravitating towards - my vague
thought was adding some flag to the smmu_domain, but
pm_runtime_enabled() does look conceptually a lot cleaner.
In any case, I can't imagine this working with V4L2 or anything else
relying on any memory management more generic than calling IOMMU API
directly from the driver, with the IOMMU device having runtime PM
enabled, but without managing the runtime PM from the IOMMU driver's
callbacks that need access to the hardware. As I mentioned before,
only the IOMMU driver knows when exactly the real hardware access
needs to be done (e.g. Rockchip/Exynos don't need to do that for
map/unmap if the power is down, but some implementations of SMMU with
TLB powered separately might need to do so).
It's worth noting that Exynos and Rockchip are relatively small
self-contained IP blocks integrated closely with the interfaces of their
relevant master devices; SMMU is an architecture, implementations of which
may be large, distributed, and have complex and wildly differing internal
topologies. As such, it's a lot harder to make hardware-specific assumptions
and/or be correct for all possible cases.
Don't get me wrong, I do ultimately agree that the IOMMU driver is the only
agent who ultimately knows what calls are going to be necessary for whatever
operation it's performing on its own hardware*; it's just that for SMMU it
needs to be implemented in a way that has zero impact on the cases where it
doesn't matter, because it's not viable to specialise that driver for any
particular IP implementation/use-case.
Still, exactly the same holds for the low power embedded use cases,
where we strive for the lowest possible power consumption, while
maintaining performance levels high as well. And so the SMMU code is
expected to also work with our use cases, such as V4L2 or DRM drivers.
Since these points don't hold for current SMMU code, I could say that
the it has been already specialized for large, distributed
Heh, really it's specialised for ease of maintenance in terms of doing
as little as we can get away with, but for what we have implemented,
fast code does save CPU cycles and power on embedded systems too ;)