On 04/14/2018 10:33 AM, Daniel Vetter wrote:
On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
On 04/13/2018 07:13 PM, Daniel Vetter wrote:
On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
Otherwise perhaps rename them or document that they're actually doing
I think a doc patch would be good at least. Including all the data you
Actually, a further investigation appears to indicate that manipulating the
lock state under a local spinlock is about fast as using atomic operations
even for the completely uncontended cases.
This means that we could have a solution where you decide on a per-mutex or
per-reservation object basis whether you want to manipulate lock-state under
a "batch group" spinlock, meaning certain performance characteristics or
traditional local locking, meaning other performance characteristics.
Like, vmwgfx could choose batching locks, radeon traditional locks, but the
same API would work for both and locks could be shared between drivers..
Don't we need to make this decision at least on a per-class level?
No, I was thinking more in the line of the ww_mutex having a pointer to
the spinlock. It could either be the local mutex "wait_lock", or a
per-batch-group lock. The mutex code wouldn't care. We do need a special
API for batched locking, though, but not for ordinary locking.
Both APIs should be able to handle local or grouped spinlocks.
Note that this would of course require that there would be no
performance loss for users that don't use batch groups.
I guess the most efficient use for GPU command submission would be to
use per-process batch-groups. Then when the batch encounters a ww_mutex
with a different batch group (for example the display server shared
surface, it'll just switch batch lock), and this way the contention for
the batch spinlock will be mostly eliminated.
how will the spinlock/batch-lock approach interact with the normal
ww_mutex_lock path (which does require the atomics/ordered stores
we're trying to avoid)?
We can use the same code with some extra
if (use_ww_ctx) in the common locking and unlocking path.
Note that the "use_ww_ctx" parameter is defined at compile-time so the
ordinary mutex path (binary) shouldn't change at all after optimization
but I need to verify that, of course.
What you can't do with such a change is to lock / unlock a ww_mutex
using the standard mutex API, like mutex_lock(&ww_mutex->base), but I
guess that would be OK?
If we can't mix them I'm kinda leaning towards a
ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
otherwise. We probably do need the new batch_start/end api, since
ww_acquire_done isn't quite the right place ...
I'll see if I get time to put together an RFC.
Yeah I think there's definitely some use for batched ww locks, where
parallelism is generally low, or at least the ratio between "time
spent acquiring locks" and "time spent doing stuff while holding
locks" small enough to not make the reduced parallelism while
acquiring an issue.
Yes. At least it's worth bringing up for discussion. The reduced
parallelism shouldn't be an issue if per-process batch-groups are used,
or, like for vmwgfx the command submission itself is serialized, due to
a single FIFO.
dri-devel mailing list