Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-07-09 Thread Daniel Vetter
On Thu, Jul 09, 2020 at 08:29:21AM +0100, Daniel Stone wrote:
> Hi,
> Jumping in after a couple of weeks where I've paged most everything
> out of my brain ...
> 
> On Fri, 19 Jun 2020 at 10:43, Daniel Vetter  wrote:
> > On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
> > > > The proposed patches might very well encode the wrong contract, that's
> > > > all up for discussion. But fundamentally questioning that we need one
> > > > is missing what upstream is all about.
> > >
> > > Then I have not clearly communicated, as my opinion is not that
> > > validation is worthless, but that the implementation is enshrining a
> > > global property on a low level primitive that prevents it from being
> > > used elsewhere. And I want to replace completion [chains] with fences, and
> > > bio with fences, and closures with fences, and what other equivalencies
> > > there are in the kernel. The fence is as central a locking construct as
> > > struct completion and deserves to be a foundational primitive provided
> > > by kernel/ used throughout all drivers for discrete problem domains.
> > >
> > > This is narrowing dma_fence whereby adding
> > >   struct lockdep_map *dma_fence::wait_map
> > > and annotating linkage, allows you to continue to specify that all
> > > dma_fence used for a particular purpose must follow common rules,
> > > without restricting the primitive for uses outside of this scope.
> >
> > Somewhere else in this thread I had discussions with Jason Gunthorpe about
> > this topic. It might maybe change somewhat depending upon exact rules, but
> > his take is very much "I don't want dma_fence in rdma". Or pretty close to
> > that at least.
> >
> > Similar discussions with habanalabs, they're using dma_fence internally
> > without any of the uapi. Discussion there has also now concluded that it's
> > best if they remove them, and simply switch over to a wait_queue or
> > completion like every other driver does.
> >
> > The next round of the patches already have a paragraph to at least
> > somewhat limit how non-gpu drivers use dma_fence. And I guess actual
> > consensus might be pointing even more strongly at dma_fence being solely
> > something for gpus and closely related subsystem (maybe media) for syncing
> > dma-buf access.
> >
> > So dma_fence as general replacement for completion chains I think just
> > wont happen.
> >
> > What might make sense is if e.g. the lockdep annotations could be reused,
> > at least in design, for wait_queue or completion or anything else
> > really. I do think that has a fair chance compared to the automagic
> > cross-release annotations approach, which relied way too heavily on
> > guessing where barriers are. My experience from just a bit of playing
> > around with these patches here and discussing them with other driver
> > maintainers is that accurately deciding where critical sections start and
> > end is a job for humans only. And if you get it wrong, you will have a
> > false positive.
> >
> > And you're indeed correct that if we'd do annotations for completions and
> > wait queues, then that would need to have a class per semantically
> > equivalent user, like we have lockdep classes for mutexes, not just one
> > overall.
> >
> > But dma_fence otoh is something very specific, which comes with very
> > specific rules attached - it's not a generic wait_queue at all. Originally
> > it did start out as one even, but it is a very specialized wait_queue.
> >
> > So there's imo two cases:
> >
> > - Your completion is entirely orthogonal of dma_fences, and can never ever
> >   block a dma_fence. Don't use dma_fence for this, and no problem. It's
> >   just another wait_queue somewhere.
> >
> > - Your completion can eventually, maybe through lots of convolutions and
> >   depdencies, block a dma_fence. In that case full dma_fence rules apply,
> >   and the only thing you can do with a custom annotation is make the rules
> >   even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to
> >   take certain scheduler locks. But the userspace visible/published fence
> >   do take them, maybe as part of command submission or retirement.
> >   Entirely hypotethical, no idea any driver actually needs this.
> 
> I don't claim to understand the implementation of i915's scheduler and
> GEM handling, and it seems like there's some public context missing
> here. But to me, the above is a good statement of what I (and a lot of
> other userspace) have been relying on - that dma-fence is a very
> tightly scoped thing which is very predictable but in extremis.
> 
> It would be great to have something like this enshrined in dma-fence
> documentation, visible to both kernel and external users. The
> properties we've so far been assuming for the graphics pipeline -
> covering production & execution of vertex/fragment workloads on the
> GPU, framebuffer display, and to the extent this is necessary
> involving compute - are something like this:
> 
> A single 

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-07-09 Thread Daniel Stone
Hi,
Jumping in after a couple of weeks where I've paged most everything
out of my brain ...

On Fri, 19 Jun 2020 at 10:43, Daniel Vetter  wrote:
> On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
> > > The proposed patches might very well encode the wrong contract, that's
> > > all up for discussion. But fundamentally questioning that we need one
> > > is missing what upstream is all about.
> >
> > Then I have not clearly communicated, as my opinion is not that
> > validation is worthless, but that the implementation is enshrining a
> > global property on a low level primitive that prevents it from being
> > used elsewhere. And I want to replace completion [chains] with fences, and
> > bio with fences, and closures with fences, and what other equivalencies
> > there are in the kernel. The fence is as central a locking construct as
> > struct completion and deserves to be a foundational primitive provided
> > by kernel/ used throughout all drivers for discrete problem domains.
> >
> > This is narrowing dma_fence whereby adding
> >   struct lockdep_map *dma_fence::wait_map
> > and annotating linkage, allows you to continue to specify that all
> > dma_fence used for a particular purpose must follow common rules,
> > without restricting the primitive for uses outside of this scope.
>
> Somewhere else in this thread I had discussions with Jason Gunthorpe about
> this topic. It might maybe change somewhat depending upon exact rules, but
> his take is very much "I don't want dma_fence in rdma". Or pretty close to
> that at least.
>
> Similar discussions with habanalabs, they're using dma_fence internally
> without any of the uapi. Discussion there has also now concluded that it's
> best if they remove them, and simply switch over to a wait_queue or
> completion like every other driver does.
>
> The next round of the patches already have a paragraph to at least
> somewhat limit how non-gpu drivers use dma_fence. And I guess actual
> consensus might be pointing even more strongly at dma_fence being solely
> something for gpus and closely related subsystem (maybe media) for syncing
> dma-buf access.
>
> So dma_fence as general replacement for completion chains I think just
> wont happen.
>
> What might make sense is if e.g. the lockdep annotations could be reused,
> at least in design, for wait_queue or completion or anything else
> really. I do think that has a fair chance compared to the automagic
> cross-release annotations approach, which relied way too heavily on
> guessing where barriers are. My experience from just a bit of playing
> around with these patches here and discussing them with other driver
> maintainers is that accurately deciding where critical sections start and
> end is a job for humans only. And if you get it wrong, you will have a
> false positive.
>
> And you're indeed correct that if we'd do annotations for completions and
> wait queues, then that would need to have a class per semantically
> equivalent user, like we have lockdep classes for mutexes, not just one
> overall.
>
> But dma_fence otoh is something very specific, which comes with very
> specific rules attached - it's not a generic wait_queue at all. Originally
> it did start out as one even, but it is a very specialized wait_queue.
>
> So there's imo two cases:
>
> - Your completion is entirely orthogonal of dma_fences, and can never ever
>   block a dma_fence. Don't use dma_fence for this, and no problem. It's
>   just another wait_queue somewhere.
>
> - Your completion can eventually, maybe through lots of convolutions and
>   depdencies, block a dma_fence. In that case full dma_fence rules apply,
>   and the only thing you can do with a custom annotation is make the rules
>   even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to
>   take certain scheduler locks. But the userspace visible/published fence
>   do take them, maybe as part of command submission or retirement.
>   Entirely hypotethical, no idea any driver actually needs this.

I don't claim to understand the implementation of i915's scheduler and
GEM handling, and it seems like there's some public context missing
here. But to me, the above is a good statement of what I (and a lot of
other userspace) have been relying on - that dma-fence is a very
tightly scoped thing which is very predictable but in extremis.

It would be great to have something like this enshrined in dma-fence
documentation, visible to both kernel and external users. The
properties we've so far been assuming for the graphics pipeline -
covering production & execution of vertex/fragment workloads on the
GPU, framebuffer display, and to the extent this is necessary
involving compute - are something like this:

A single dma-fence with no dependencies represents (the tail of) a
unit of work, which has been all but committed to the hardware. Once
committed to the hardware, this work will complete (successfully or in
error) in bounded time. The unit of work referred to 

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-22 Thread Daniel Vetter
On Fri, Jun 19, 2020 at 3:12 PM Chris Wilson  wrote:
>
> Quoting Daniel Vetter (2020-06-19 10:43:09)
> > On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
> > > Quoting Daniel Vetter (2020-06-19 09:51:59)
> > > > On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson 
> > > >  wrote:
> > > > > Forcing a generic primitive to always be part of the same global map 
> > > > > is
> > > > > horrible.
> > > >
> > > > And  no concrete example or reason for why that's not possible.
> > > > Because frankly it's not horrible, this is what upstream is all about:
> > > > Shared concepts, shared contracts, shared code.
> > > >
> > > > The proposed patches might very well encode the wrong contract, that's
> > > > all up for discussion. But fundamentally questioning that we need one
> > > > is missing what upstream is all about.
> > >
> > > Then I have not clearly communicated, as my opinion is not that
> > > validation is worthless, but that the implementation is enshrining a
> > > global property on a low level primitive that prevents it from being
> > > used elsewhere. And I want to replace completion [chains] with fences, and
> > > bio with fences, and closures with fences, and what other equivalencies
> > > there are in the kernel. The fence is as central a locking construct as
> > > struct completion and deserves to be a foundational primitive provided
> > > by kernel/ used throughout all drivers for discrete problem domains.
> > >
> > > This is narrowing dma_fence whereby adding
> > >   struct lockdep_map *dma_fence::wait_map
> > > and annotating linkage, allows you to continue to specify that all
> > > dma_fence used for a particular purpose must follow common rules,
> > > without restricting the primitive for uses outside of this scope.
> >
> > Somewhere else in this thread I had discussions with Jason Gunthorpe about
> > this topic. It might maybe change somewhat depending upon exact rules, but
> > his take is very much "I don't want dma_fence in rdma". Or pretty close to
> > that at least.
> >
> > Similar discussions with habanalabs, they're using dma_fence internally
> > without any of the uapi. Discussion there has also now concluded that it's
> > best if they remove them, and simply switch over to a wait_queue or
> > completion like every other driver does.
> >
> > The next round of the patches already have a paragraph to at least
> > somewhat limit how non-gpu drivers use dma_fence. And I guess actual
> > consensus might be pointing even more strongly at dma_fence being solely
> > something for gpus and closely related subsystem (maybe media) for syncing
> > dma-buf access.
> >
> > So dma_fence as general replacement for completion chains I think just
> > wont happen.
>
> That is sad. I cannot comprehend going back to pure completions after a
> taste of fence scheduling. And we are not even close to fully utilising
> them, as not all the async cpu [allocation!] tasks are fully tracked by
> fences yet and are still stuck in a FIFO workqueue.
>
> > What might make sense is if e.g. the lockdep annotations could be reused,
> > at least in design, for wait_queue or completion or anything else
> > really. I do think that has a fair chance compared to the automagic
> > cross-release annotations approach, which relied way too heavily on
> > guessing where barriers are. My experience from just a bit of playing
> > around with these patches here and discussing them with other driver
> > maintainers is that accurately deciding where critical sections start and
> > end is a job for humans only. And if you get it wrong, you will have a
> > false positive.
> >
> > And you're indeed correct that if we'd do annotations for completions and
> > wait queues, then that would need to have a class per semantically
> > equivalent user, like we have lockdep classes for mutexes, not just one
> > overall.
> >
> > But dma_fence otoh is something very specific, which comes with very
> > specific rules attached - it's not a generic wait_queue at all. Originally
> > it did start out as one even, but it is a very specialized wait_queue.
> >
> > So there's imo two cases:
> >
> > - Your completion is entirely orthogonal of dma_fences, and can never ever
> >   block a dma_fence. Don't use dma_fence for this, and no problem. It's
> >   just another wait_queue somewhere.
> >
> > - Your completion can eventually, maybe through lots of convolutions and
> >   depdencies, block a dma_fence. In that case full dma_fence rules apply,
> >   and the only thing you can do with a custom annotation is make the rules
> >   even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to
> >   take certain scheduler locks. But the userspace visible/published fence
> >   do take them, maybe as part of command submission or retirement.
> >   Entirely hypotethical, no idea any driver actually needs this.
>
> I think we are faced with this very real problem.
>
> The papering we have today over userptr is so very thin, and if you
> squint you can 

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-19 Thread Chris Wilson
Quoting Daniel Vetter (2020-06-19 10:43:09)
> On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2020-06-19 09:51:59)
> > > On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson  
> > > wrote:
> > > > Forcing a generic primitive to always be part of the same global map is
> > > > horrible.
> > > 
> > > And  no concrete example or reason for why that's not possible.
> > > Because frankly it's not horrible, this is what upstream is all about:
> > > Shared concepts, shared contracts, shared code.
> > > 
> > > The proposed patches might very well encode the wrong contract, that's
> > > all up for discussion. But fundamentally questioning that we need one
> > > is missing what upstream is all about.
> > 
> > Then I have not clearly communicated, as my opinion is not that
> > validation is worthless, but that the implementation is enshrining a
> > global property on a low level primitive that prevents it from being
> > used elsewhere. And I want to replace completion [chains] with fences, and
> > bio with fences, and closures with fences, and what other equivalencies
> > there are in the kernel. The fence is as central a locking construct as
> > struct completion and deserves to be a foundational primitive provided
> > by kernel/ used throughout all drivers for discrete problem domains.
> > 
> > This is narrowing dma_fence whereby adding
> >   struct lockdep_map *dma_fence::wait_map
> > and annotating linkage, allows you to continue to specify that all
> > dma_fence used for a particular purpose must follow common rules,
> > without restricting the primitive for uses outside of this scope.
> 
> Somewhere else in this thread I had discussions with Jason Gunthorpe about
> this topic. It might maybe change somewhat depending upon exact rules, but
> his take is very much "I don't want dma_fence in rdma". Or pretty close to
> that at least.
> 
> Similar discussions with habanalabs, they're using dma_fence internally
> without any of the uapi. Discussion there has also now concluded that it's
> best if they remove them, and simply switch over to a wait_queue or
> completion like every other driver does.
> 
> The next round of the patches already have a paragraph to at least
> somewhat limit how non-gpu drivers use dma_fence. And I guess actual
> consensus might be pointing even more strongly at dma_fence being solely
> something for gpus and closely related subsystem (maybe media) for syncing
> dma-buf access.
> 
> So dma_fence as general replacement for completion chains I think just
> wont happen.

That is sad. I cannot comprehend going back to pure completions after a
taste of fence scheduling. And we are not even close to fully utilising
them, as not all the async cpu [allocation!] tasks are fully tracked by
fences yet and are still stuck in a FIFO workqueue.

> What might make sense is if e.g. the lockdep annotations could be reused,
> at least in design, for wait_queue or completion or anything else
> really. I do think that has a fair chance compared to the automagic
> cross-release annotations approach, which relied way too heavily on
> guessing where barriers are. My experience from just a bit of playing
> around with these patches here and discussing them with other driver
> maintainers is that accurately deciding where critical sections start and
> end is a job for humans only. And if you get it wrong, you will have a
> false positive.
> 
> And you're indeed correct that if we'd do annotations for completions and
> wait queues, then that would need to have a class per semantically
> equivalent user, like we have lockdep classes for mutexes, not just one
> overall.
> 
> But dma_fence otoh is something very specific, which comes with very
> specific rules attached - it's not a generic wait_queue at all. Originally
> it did start out as one even, but it is a very specialized wait_queue.
> 
> So there's imo two cases:
> 
> - Your completion is entirely orthogonal of dma_fences, and can never ever
>   block a dma_fence. Don't use dma_fence for this, and no problem. It's
>   just another wait_queue somewhere.
> 
> - Your completion can eventually, maybe through lots of convolutions and
>   depdencies, block a dma_fence. In that case full dma_fence rules apply,
>   and the only thing you can do with a custom annotation is make the rules
>   even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to
>   take certain scheduler locks. But the userspace visible/published fence
>   do take them, maybe as part of command submission or retirement.
>   Entirely hypotethical, no idea any driver actually needs this.

I think we are faced with this very real problem.

The papering we have today over userptr is so very thin, and if you
squint you can already see it is coupled into the completion signal. Just
it happens to be on the other side of the fence.

The next batch of priority inversions involve integrating the async cpu
tasks into the scheduler, and have full dependency 

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-19 Thread Daniel Vetter
On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2020-06-19 09:51:59)
> > On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson  
> > wrote:
> > > Forcing a generic primitive to always be part of the same global map is
> > > horrible.
> > 
> > And  no concrete example or reason for why that's not possible.
> > Because frankly it's not horrible, this is what upstream is all about:
> > Shared concepts, shared contracts, shared code.
> > 
> > The proposed patches might very well encode the wrong contract, that's
> > all up for discussion. But fundamentally questioning that we need one
> > is missing what upstream is all about.
> 
> Then I have not clearly communicated, as my opinion is not that
> validation is worthless, but that the implementation is enshrining a
> global property on a low level primitive that prevents it from being
> used elsewhere. And I want to replace completion [chains] with fences, and
> bio with fences, and closures with fences, and what other equivalencies
> there are in the kernel. The fence is as central a locking construct as
> struct completion and deserves to be a foundational primitive provided
> by kernel/ used throughout all drivers for discrete problem domains.
> 
> This is narrowing dma_fence whereby adding
>   struct lockdep_map *dma_fence::wait_map
> and annotating linkage, allows you to continue to specify that all
> dma_fence used for a particular purpose must follow common rules,
> without restricting the primitive for uses outside of this scope.

Somewhere else in this thread I had discussions with Jason Gunthorpe about
this topic. It might maybe change somewhat depending upon exact rules, but
his take is very much "I don't want dma_fence in rdma". Or pretty close to
that at least.

Similar discussions with habanalabs, they're using dma_fence internally
without any of the uapi. Discussion there has also now concluded that it's
best if they remove them, and simply switch over to a wait_queue or
completion like every other driver does.

The next round of the patches already have a paragraph to at least
somewhat limit how non-gpu drivers use dma_fence. And I guess actual
consensus might be pointing even more strongly at dma_fence being solely
something for gpus and closely related subsystem (maybe media) for syncing
dma-buf access.

So dma_fence as general replacement for completion chains I think just
wont happen.

What might make sense is if e.g. the lockdep annotations could be reused,
at least in design, for wait_queue or completion or anything else
really. I do think that has a fair chance compared to the automagic
cross-release annotations approach, which relied way too heavily on
guessing where barriers are. My experience from just a bit of playing
around with these patches here and discussing them with other driver
maintainers is that accurately deciding where critical sections start and
end is a job for humans only. And if you get it wrong, you will have a
false positive.

And you're indeed correct that if we'd do annotations for completions and
wait queues, then that would need to have a class per semantically
equivalent user, like we have lockdep classes for mutexes, not just one
overall.

But dma_fence otoh is something very specific, which comes with very
specific rules attached - it's not a generic wait_queue at all. Originally
it did start out as one even, but it is a very specialized wait_queue.

So there's imo two cases:

- Your completion is entirely orthogonal of dma_fences, and can never ever
  block a dma_fence. Don't use dma_fence for this, and no problem. It's
  just another wait_queue somewhere.

- Your completion can eventually, maybe through lots of convolutions and
  depdencies, block a dma_fence. In that case full dma_fence rules apply,
  and the only thing you can do with a custom annotation is make the rules
  even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to
  take certain scheduler locks. But the userspace visible/published fence
  do take them, maybe as part of command submission or retirement.
  Entirely hypotethical, no idea any driver actually needs this.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-19 Thread Chris Wilson
Quoting Daniel Vetter (2020-06-19 09:51:59)
> On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson  
> wrote:
> > Forcing a generic primitive to always be part of the same global map is
> > horrible.
> 
> And  no concrete example or reason for why that's not possible.
> Because frankly it's not horrible, this is what upstream is all about:
> Shared concepts, shared contracts, shared code.
> 
> The proposed patches might very well encode the wrong contract, that's
> all up for discussion. But fundamentally questioning that we need one
> is missing what upstream is all about.

Then I have not clearly communicated, as my opinion is not that
validation is worthless, but that the implementation is enshrining a
global property on a low level primitive that prevents it from being
used elsewhere. And I want to replace completion [chains] with fences, and
bio with fences, and closures with fences, and what other equivalencies
there are in the kernel. The fence is as central a locking construct as
struct completion and deserves to be a foundational primitive provided
by kernel/ used throughout all drivers for discrete problem domains.

This is narrowing dma_fence whereby adding
struct lockdep_map *dma_fence::wait_map
and annotating linkage, allows you to continue to specify that all
dma_fence used for a particular purpose must follow common rules,
without restricting the primitive for uses outside of this scope.
-Chris
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-19 Thread Daniel Vetter
On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson  wrote:
>
> Quoting Daniel Stone (2020-06-11 10:01:46)
> > Hi,
> >
> > On Thu, 11 Jun 2020 at 09:44, Dave Airlie  wrote:
> > > On Thu, 11 Jun 2020 at 18:01, Chris Wilson  
> > > wrote:
> > > > Introducing a global lockmap that cannot capture the rules correctly,
> > >
> > > Can you document the rules all drivers should be following then,
> > > because from here it looks to get refactored every version of i915,
> > > and it would be nice if we could all aim for the same set of things
> > > roughly. We've already had enough problems with amdgpu vs i915 vs
> > > everyone else with fences, if this stops that in the future then I'd
> > > rather we have that than just some unwritten rules per driver and
> > > untestable.
> >
> > As someone who has sunk a bunch of work into explicit-fencing
> > awareness in my compositor so I can never be blocked, I'd be
> > disappointed if the infrastructure was ultimately pointless because
> > the documented fencing rules were \_o_/ or thereabouts. Lockdep
> > definitely isn't my area of expertise so I can't comment on the patch
> > per se, but having something to ensure we don't hit deadlocks sure
> > seems a lot better than nothing.
>
> This is doing dependency analysis on execution contexts which is a far
> cry from doing the fence dependency analysis, and so has to actively
> ignore the cycles that must exist on the dma side, and also the cycles
> that prevent entering execution contexts on the CPU. It has to actively
> ignore scheduler execution contexts, for lockdep cries, and so we do not
> get analysis of the locking contexts along that path. This would be
> solvable along the lines of extending lockdep ala lockdep_dma_enter().

drm/scheduler is annotated, found some rather improbably to hit issues
in practice. But from the quick chat I've had with König and others I
think he agrees that it's real at least in the theoretical sense.
Probably should consider playing lottery if you hit it in practice
though :-)

> Had i915's execution flow been marked up, it should have found the
> dubious wait for external fences inside the dead GPU recovery, and
> probably found a few more things to complain about with the reset locking.
> [Note we already do the same annotations for wait-vs-reset, but not
> reset-vs-execution.]

I know it splats, that's why the tdr annotation patch comes with a
spec proposal for lifting the wait busting we do in i915 to the
dma_fence level. I included that because amdgpu has the same problem
on modern hw. Apparently their planned fix (because they've hit this
bug in testing) was to push some shared lock down into their
atomic_comit_tail function and use that in gpu reset, so don't seem
that interested in extending dma_fence.

For i915 it's just gen2/3 display, and cross-driver dma-buf/fence
usage for those is nil and won't change. Pragmatic solution imo would
be to just not annotate gpu reset on these platforms, and relying on
our wait busting plus igt tests to make sure it keeps working as-is.
The point of the explicit annotations for the signalling side is very
much that it can be rolled out gradually, and entirely left out for
old legacy paths that aren't worth fixing.

> Determination of which waits are legal and which are not is entirely ad
> hoc, for there is no status change tracking in the dependency analysis
> [that is once an execution context is linked to a published fence, again
> integral to lockdep.] Consider if the completion chain in atomic is
> swapped out for the morally equivalent fences along intertwined timelines,
> and so it does a bunch of dma_fence_wait() instead. Why are those waits
> legal despite them being after we have committed to fulfilling the out
> fence? [Why are the waits on and for the GPU legal, since they equally
> block execution flow?]

No need to consider, it's already real and resulted in some pretty
splats until I got the recursion handling right.

> Forcing a generic primitive to always be part of the same global map is
> horrible. You forgo being able to use the primitive for unrelated tasks,
> lose the ability to name particular contexts to gain more informative
> dependency cycle reports from having the explicit linkage. You can add
> wait_map tracking without loss of generality [in less than 10 lines],
> and you can still enforce that all fences used for a common purpose
> follow the same rules [the simplest way being to default to the singular
> wait_map]. But it's the explicitly named execution contexts that are the
> biggest boon to reading the code and reading the lockdep warns.

So one thing that's maybe not clear here: This doesn't track the DAG
of dependencies. Doesn't even try, I'm still faithfully assuming
drivers get that part right. Which is a gap and maybe we should fix
this, but not the goal here.

All this does is validate fences against anything else that might be
going on in the system. E.g. your recursion example for atomic is
handled by just 

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-19 Thread Chris Wilson
Quoting Daniel Stone (2020-06-11 10:01:46)
> Hi,
> 
> On Thu, 11 Jun 2020 at 09:44, Dave Airlie  wrote:
> > On Thu, 11 Jun 2020 at 18:01, Chris Wilson  wrote:
> > > Introducing a global lockmap that cannot capture the rules correctly,
> >
> > Can you document the rules all drivers should be following then,
> > because from here it looks to get refactored every version of i915,
> > and it would be nice if we could all aim for the same set of things
> > roughly. We've already had enough problems with amdgpu vs i915 vs
> > everyone else with fences, if this stops that in the future then I'd
> > rather we have that than just some unwritten rules per driver and
> > untestable.
> 
> As someone who has sunk a bunch of work into explicit-fencing
> awareness in my compositor so I can never be blocked, I'd be
> disappointed if the infrastructure was ultimately pointless because
> the documented fencing rules were \_o_/ or thereabouts. Lockdep
> definitely isn't my area of expertise so I can't comment on the patch
> per se, but having something to ensure we don't hit deadlocks sure
> seems a lot better than nothing.

This is doing dependency analysis on execution contexts which is a far
cry from doing the fence dependency analysis, and so has to actively
ignore the cycles that must exist on the dma side, and also the cycles
that prevent entering execution contexts on the CPU. It has to actively
ignore scheduler execution contexts, for lockdep cries, and so we do not
get analysis of the locking contexts along that path. This would be
solvable along the lines of extending lockdep ala lockdep_dma_enter().

Had i915's execution flow been marked up, it should have found the
dubious wait for external fences inside the dead GPU recovery, and
probably found a few more things to complain about with the reset locking.
[Note we already do the same annotations for wait-vs-reset, but not
reset-vs-execution.]

Determination of which waits are legal and which are not is entirely ad
hoc, for there is no status change tracking in the dependency analysis
[that is once an execution context is linked to a published fence, again
integral to lockdep.] Consider if the completion chain in atomic is
swapped out for the morally equivalent fences along intertwined timelines,
and so it does a bunch of dma_fence_wait() instead. Why are those waits
legal despite them being after we have committed to fulfilling the out
fence? [Why are the waits on and for the GPU legal, since they equally
block execution flow?]

Forcing a generic primitive to always be part of the same global map is
horrible. You forgo being able to use the primitive for unrelated tasks,
lose the ability to name particular contexts to gain more informative
dependency cycle reports from having the explicit linkage. You can add
wait_map tracking without loss of generality [in less than 10 lines],
and you can still enforce that all fences used for a common purpose
follow the same rules [the simplest way being to default to the singular
wait_map]. But it's the explicitly named execution contexts that are the
biggest boon to reading the code and reading the lockdep warns.

This is a bunch of ad hoc tracking for a very narrow purpose applied
globally, with loss of information.
-Chris
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Daniel Vetter
On Thu, Jun 11, 2020 at 4:29 PM Tvrtko Ursulin
 wrote:
>
>
> On 11/06/2020 12:29, Daniel Vetter wrote:
> > On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin
> >  wrote:
> >> On 10/06/2020 16:17, Daniel Vetter wrote:
> >>> On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
> >>>  wrote:
> 
> 
>  On 04/06/2020 09:12, Daniel Vetter wrote:
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> >
> > - We use a read-lock for the execution/worker/completion side, so that
> >  this explicit annotation can be more liberally sprinkled around.
> >  With read locks lockdep isn't going to complain if the read-side
> >  isn't nested the same way under all circumstances, so ABBA 
> > deadlocks
> >  are ok. Which they are, since this is an annotation only.
> >
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >  read lock mode lockdep does not catch read side hazards. And we
> >  _very_ much want read side hazards to be caught. For full details 
> > of
> >  this limitation see
> >
> >  commit e91498589746065e3ae95d9a00b068e525eec34f
> >  Author: Peter Zijlstra 
> >  Date:   Wed Aug 23 13:13:11 2017 +0200
> >
> >  locking/lockdep/selftests: Add mixed read-write ABBA tests
> >
> > - To allow nesting of the read-side explicit annotations we explicitly
> >  keep track of the nesting. lock_is_held() allows us to do that.
> >
> > - The wait-side annotation is a write lock, and entirely done within
> >  dma_fence_wait() for everyone by default.
> >
> > - To be able to freely annotate helper functions I want to make it ok
> >  to call dma_fence_begin/end_signalling from soft/hardirq context.
> >  First attempt was using the hardirq locking context for the write
> >  side in lockdep, but this forces all normal spinlocks nested within
> >  dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> >
> >  The approach now is to simple check in_atomic(), and for these 
> > cases
> >  entirely rely on the might_sleep() check in dma_fence_wait(). That
> >  will catch any wrong nesting against spinlocks from soft/hardirq
> >  contexts.
> >
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> >
> > The main class of deadlocks this is supposed to catch are:
> >
> > Thread A:
> >
> > mutex_lock(A);
> > mutex_unlock(A);
> >
> > dma_fence_signal();
> >
> > Thread B:
> >
> > mutex_lock(A);
> > dma_fence_wait();
> > mutex_unlock(A);
> >
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> >
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> >
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> >
> > v3: Kerneldoc.
> >
> > v4: Some spelling fixes from Mika
> >
> > Cc: Mika Kuoppala 
> > Cc: Thomas Hellstrom 
> > Cc: linux-me...@vger.kernel.org
> > Cc: linaro-mm-...@lists.linaro.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-...@lists.freedesktop.org
> > Cc: Chris Wilson 
> > Cc: Maarten Lankhorst 
> > Cc: Christian König 
> > Signed-off-by: Daniel Vetter 
> > ---
> > 

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Tvrtko Ursulin

On 11/06/2020 12:29, Daniel Vetter wrote:
> On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin
>  wrote:
>> On 10/06/2020 16:17, Daniel Vetter wrote:
>>> On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
>>>  wrote:


 On 04/06/2020 09:12, Daniel Vetter wrote:
> Design is similar to the lockdep annotations for workers, but with
> some twists:
>
> - We use a read-lock for the execution/worker/completion side, so that
>  this explicit annotation can be more liberally sprinkled around.
>  With read locks lockdep isn't going to complain if the read-side
>  isn't nested the same way under all circumstances, so ABBA deadlocks
>  are ok. Which they are, since this is an annotation only.
>
> - We're using non-recursive lockdep read lock mode, since in recursive
>  read lock mode lockdep does not catch read side hazards. And we
>  _very_ much want read side hazards to be caught. For full details of
>  this limitation see
>
>  commit e91498589746065e3ae95d9a00b068e525eec34f
>  Author: Peter Zijlstra 
>  Date:   Wed Aug 23 13:13:11 2017 +0200
>
>  locking/lockdep/selftests: Add mixed read-write ABBA tests
>
> - To allow nesting of the read-side explicit annotations we explicitly
>  keep track of the nesting. lock_is_held() allows us to do that.
>
> - The wait-side annotation is a write lock, and entirely done within
>  dma_fence_wait() for everyone by default.
>
> - To be able to freely annotate helper functions I want to make it ok
>  to call dma_fence_begin/end_signalling from soft/hardirq context.
>  First attempt was using the hardirq locking context for the write
>  side in lockdep, but this forces all normal spinlocks nested within
>  dma_fence_begin/end_signalling to be spinlocks. That bollocks.
>
>  The approach now is to simple check in_atomic(), and for these cases
>  entirely rely on the might_sleep() check in dma_fence_wait(). That
>  will catch any wrong nesting against spinlocks from soft/hardirq
>  contexts.
>
> The idea here is that every code path that's critical for eventually
> signalling a dma_fence should be annotated with
> dma_fence_begin/end_signalling. The annotation ideally starts right
> after a dma_fence is published (added to a dma_resv, exposed as a
> sync_file fd, attached to a drm_syncobj fd, or anything else that
> makes the dma_fence visible to other kernel threads), up to and
> including the dma_fence_wait(). Examples are irq handlers, the
> scheduler rt threads, the tail of execbuf (after the corresponding
> fences are visible), any workers that end up signalling dma_fences and
> really anything else. Not annotated should be code paths that only
> complete fences opportunistically as the gpu progresses, like e.g.
> shrinker/eviction code.
>
> The main class of deadlocks this is supposed to catch are:
>
> Thread A:
>
> mutex_lock(A);
> mutex_unlock(A);
>
> dma_fence_signal();
>
> Thread B:
>
> mutex_lock(A);
> dma_fence_wait();
> mutex_unlock(A);
>
> Thread B is blocked on A signalling the fence, but A never gets around
> to that because it cannot acquire the lock A.
>
> Note that dma_fence_wait() is allowed to be nested within
> dma_fence_begin/end_signalling sections. To allow this to happen the
> read lock needs to be upgraded to a write lock, which means that any
> other lock is acquired between the dma_fence_begin_signalling() call and
> the call to dma_fence_wait(), and still held, this will result in an
> immediate lockdep complaint. The only other option would be to not
> annotate such calls, defeating the point. Therefore these annotations
> cannot be sprinkled over the code entirely mindless to avoid false
> positives.
>
> v2: handle soft/hardirq ctx better against write side and dont forget
> EXPORT_SYMBOL, drivers can't use this otherwise.
>
> v3: Kerneldoc.
>
> v4: Some spelling fixes from Mika
>
> Cc: Mika Kuoppala 
> Cc: Thomas Hellstrom 
> Cc: linux-me...@vger.kernel.org
> Cc: linaro-mm-...@lists.linaro.org
> Cc: linux-r...@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-...@lists.freedesktop.org
> Cc: Chris Wilson 
> Cc: Maarten Lankhorst 
> Cc: Christian König 
> Signed-off-by: Daniel Vetter 
> ---
> Documentation/driver-api/dma-buf.rst |  12 +-
> drivers/dma-buf/dma-fence.c  | 161 +++
> include/linux/dma-fence.h|  12 ++
> 3 files changed, 182 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/driver-api/dma-buf.rst 
> 

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Daniel Vetter
On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin
 wrote:
>
>
> On 10/06/2020 16:17, Daniel Vetter wrote:
> > On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
> >  wrote:
> >>
> >>
> >> On 04/06/2020 09:12, Daniel Vetter wrote:
> >>> Design is similar to the lockdep annotations for workers, but with
> >>> some twists:
> >>>
> >>> - We use a read-lock for the execution/worker/completion side, so that
> >>> this explicit annotation can be more liberally sprinkled around.
> >>> With read locks lockdep isn't going to complain if the read-side
> >>> isn't nested the same way under all circumstances, so ABBA deadlocks
> >>> are ok. Which they are, since this is an annotation only.
> >>>
> >>> - We're using non-recursive lockdep read lock mode, since in recursive
> >>> read lock mode lockdep does not catch read side hazards. And we
> >>> _very_ much want read side hazards to be caught. For full details of
> >>> this limitation see
> >>>
> >>> commit e91498589746065e3ae95d9a00b068e525eec34f
> >>> Author: Peter Zijlstra 
> >>> Date:   Wed Aug 23 13:13:11 2017 +0200
> >>>
> >>> locking/lockdep/selftests: Add mixed read-write ABBA tests
> >>>
> >>> - To allow nesting of the read-side explicit annotations we explicitly
> >>> keep track of the nesting. lock_is_held() allows us to do that.
> >>>
> >>> - The wait-side annotation is a write lock, and entirely done within
> >>> dma_fence_wait() for everyone by default.
> >>>
> >>> - To be able to freely annotate helper functions I want to make it ok
> >>> to call dma_fence_begin/end_signalling from soft/hardirq context.
> >>> First attempt was using the hardirq locking context for the write
> >>> side in lockdep, but this forces all normal spinlocks nested within
> >>> dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> >>>
> >>> The approach now is to simple check in_atomic(), and for these cases
> >>> entirely rely on the might_sleep() check in dma_fence_wait(). That
> >>> will catch any wrong nesting against spinlocks from soft/hardirq
> >>> contexts.
> >>>
> >>> The idea here is that every code path that's critical for eventually
> >>> signalling a dma_fence should be annotated with
> >>> dma_fence_begin/end_signalling. The annotation ideally starts right
> >>> after a dma_fence is published (added to a dma_resv, exposed as a
> >>> sync_file fd, attached to a drm_syncobj fd, or anything else that
> >>> makes the dma_fence visible to other kernel threads), up to and
> >>> including the dma_fence_wait(). Examples are irq handlers, the
> >>> scheduler rt threads, the tail of execbuf (after the corresponding
> >>> fences are visible), any workers that end up signalling dma_fences and
> >>> really anything else. Not annotated should be code paths that only
> >>> complete fences opportunistically as the gpu progresses, like e.g.
> >>> shrinker/eviction code.
> >>>
> >>> The main class of deadlocks this is supposed to catch are:
> >>>
> >>> Thread A:
> >>>
> >>>mutex_lock(A);
> >>>mutex_unlock(A);
> >>>
> >>>dma_fence_signal();
> >>>
> >>> Thread B:
> >>>
> >>>mutex_lock(A);
> >>>dma_fence_wait();
> >>>mutex_unlock(A);
> >>>
> >>> Thread B is blocked on A signalling the fence, but A never gets around
> >>> to that because it cannot acquire the lock A.
> >>>
> >>> Note that dma_fence_wait() is allowed to be nested within
> >>> dma_fence_begin/end_signalling sections. To allow this to happen the
> >>> read lock needs to be upgraded to a write lock, which means that any
> >>> other lock is acquired between the dma_fence_begin_signalling() call and
> >>> the call to dma_fence_wait(), and still held, this will result in an
> >>> immediate lockdep complaint. The only other option would be to not
> >>> annotate such calls, defeating the point. Therefore these annotations
> >>> cannot be sprinkled over the code entirely mindless to avoid false
> >>> positives.
> >>>
> >>> v2: handle soft/hardirq ctx better against write side and dont forget
> >>> EXPORT_SYMBOL, drivers can't use this otherwise.
> >>>
> >>> v3: Kerneldoc.
> >>>
> >>> v4: Some spelling fixes from Mika
> >>>
> >>> Cc: Mika Kuoppala 
> >>> Cc: Thomas Hellstrom 
> >>> Cc: linux-me...@vger.kernel.org
> >>> Cc: linaro-mm-...@lists.linaro.org
> >>> Cc: linux-r...@vger.kernel.org
> >>> Cc: amd-gfx@lists.freedesktop.org
> >>> Cc: intel-...@lists.freedesktop.org
> >>> Cc: Chris Wilson 
> >>> Cc: Maarten Lankhorst 
> >>> Cc: Christian König 
> >>> Signed-off-by: Daniel Vetter 
> >>> ---
> >>>Documentation/driver-api/dma-buf.rst |  12 +-
> >>>drivers/dma-buf/dma-fence.c  | 161 +++
> >>>include/linux/dma-fence.h|  12 ++
> >>>3 files changed, 182 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/Documentation/driver-api/dma-buf.rst 
> >>> b/Documentation/driver-api/dma-buf.rst
> >>> index 63dec76d1d8d..05d856131140 100644
> 

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Tvrtko Ursulin

On 10/06/2020 16:17, Daniel Vetter wrote:
> On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
>  wrote:
>>
>>
>> On 04/06/2020 09:12, Daniel Vetter wrote:
>>> Design is similar to the lockdep annotations for workers, but with
>>> some twists:
>>>
>>> - We use a read-lock for the execution/worker/completion side, so that
>>> this explicit annotation can be more liberally sprinkled around.
>>> With read locks lockdep isn't going to complain if the read-side
>>> isn't nested the same way under all circumstances, so ABBA deadlocks
>>> are ok. Which they are, since this is an annotation only.
>>>
>>> - We're using non-recursive lockdep read lock mode, since in recursive
>>> read lock mode lockdep does not catch read side hazards. And we
>>> _very_ much want read side hazards to be caught. For full details of
>>> this limitation see
>>>
>>> commit e91498589746065e3ae95d9a00b068e525eec34f
>>> Author: Peter Zijlstra 
>>> Date:   Wed Aug 23 13:13:11 2017 +0200
>>>
>>> locking/lockdep/selftests: Add mixed read-write ABBA tests
>>>
>>> - To allow nesting of the read-side explicit annotations we explicitly
>>> keep track of the nesting. lock_is_held() allows us to do that.
>>>
>>> - The wait-side annotation is a write lock, and entirely done within
>>> dma_fence_wait() for everyone by default.
>>>
>>> - To be able to freely annotate helper functions I want to make it ok
>>> to call dma_fence_begin/end_signalling from soft/hardirq context.
>>> First attempt was using the hardirq locking context for the write
>>> side in lockdep, but this forces all normal spinlocks nested within
>>> dma_fence_begin/end_signalling to be spinlocks. That bollocks.
>>>
>>> The approach now is to simple check in_atomic(), and for these cases
>>> entirely rely on the might_sleep() check in dma_fence_wait(). That
>>> will catch any wrong nesting against spinlocks from soft/hardirq
>>> contexts.
>>>
>>> The idea here is that every code path that's critical for eventually
>>> signalling a dma_fence should be annotated with
>>> dma_fence_begin/end_signalling. The annotation ideally starts right
>>> after a dma_fence is published (added to a dma_resv, exposed as a
>>> sync_file fd, attached to a drm_syncobj fd, or anything else that
>>> makes the dma_fence visible to other kernel threads), up to and
>>> including the dma_fence_wait(). Examples are irq handlers, the
>>> scheduler rt threads, the tail of execbuf (after the corresponding
>>> fences are visible), any workers that end up signalling dma_fences and
>>> really anything else. Not annotated should be code paths that only
>>> complete fences opportunistically as the gpu progresses, like e.g.
>>> shrinker/eviction code.
>>>
>>> The main class of deadlocks this is supposed to catch are:
>>>
>>> Thread A:
>>>
>>>mutex_lock(A);
>>>mutex_unlock(A);
>>>
>>>dma_fence_signal();
>>>
>>> Thread B:
>>>
>>>mutex_lock(A);
>>>dma_fence_wait();
>>>mutex_unlock(A);
>>>
>>> Thread B is blocked on A signalling the fence, but A never gets around
>>> to that because it cannot acquire the lock A.
>>>
>>> Note that dma_fence_wait() is allowed to be nested within
>>> dma_fence_begin/end_signalling sections. To allow this to happen the
>>> read lock needs to be upgraded to a write lock, which means that any
>>> other lock is acquired between the dma_fence_begin_signalling() call and
>>> the call to dma_fence_wait(), and still held, this will result in an
>>> immediate lockdep complaint. The only other option would be to not
>>> annotate such calls, defeating the point. Therefore these annotations
>>> cannot be sprinkled over the code entirely mindless to avoid false
>>> positives.
>>>
>>> v2: handle soft/hardirq ctx better against write side and dont forget
>>> EXPORT_SYMBOL, drivers can't use this otherwise.
>>>
>>> v3: Kerneldoc.
>>>
>>> v4: Some spelling fixes from Mika
>>>
>>> Cc: Mika Kuoppala 
>>> Cc: Thomas Hellstrom 
>>> Cc: linux-me...@vger.kernel.org
>>> Cc: linaro-mm-...@lists.linaro.org
>>> Cc: linux-r...@vger.kernel.org
>>> Cc: amd-gfx@lists.freedesktop.org
>>> Cc: intel-...@lists.freedesktop.org
>>> Cc: Chris Wilson 
>>> Cc: Maarten Lankhorst 
>>> Cc: Christian König 
>>> Signed-off-by: Daniel Vetter 
>>> ---
>>>Documentation/driver-api/dma-buf.rst |  12 +-
>>>drivers/dma-buf/dma-fence.c  | 161 +++
>>>include/linux/dma-fence.h|  12 ++
>>>3 files changed, 182 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/driver-api/dma-buf.rst 
>>> b/Documentation/driver-api/dma-buf.rst
>>> index 63dec76d1d8d..05d856131140 100644
>>> --- a/Documentation/driver-api/dma-buf.rst
>>> +++ b/Documentation/driver-api/dma-buf.rst
>>> @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects
>>>.. kernel-doc:: drivers/dma-buf/dma-buf.c
>>>   :doc: cpu access
>>>
>>> -Fence Poll Support
>>> -~~

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Daniel Stone
Hi,

On Thu, 11 Jun 2020 at 09:44, Dave Airlie  wrote:
> On Thu, 11 Jun 2020 at 18:01, Chris Wilson  wrote:
> > Introducing a global lockmap that cannot capture the rules correctly,
>
> Can you document the rules all drivers should be following then,
> because from here it looks to get refactored every version of i915,
> and it would be nice if we could all aim for the same set of things
> roughly. We've already had enough problems with amdgpu vs i915 vs
> everyone else with fences, if this stops that in the future then I'd
> rather we have that than just some unwritten rules per driver and
> untestable.

As someone who has sunk a bunch of work into explicit-fencing
awareness in my compositor so I can never be blocked, I'd be
disappointed if the infrastructure was ultimately pointless because
the documented fencing rules were \_o_/ or thereabouts. Lockdep
definitely isn't my area of expertise so I can't comment on the patch
per se, but having something to ensure we don't hit deadlocks sure
seems a lot better than nothing.

Cheers,
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-10 Thread Daniel Vetter
On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
 wrote:
>
>
> On 04/06/2020 09:12, Daniel Vetter wrote:
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> >
> > - We use a read-lock for the execution/worker/completion side, so that
> >this explicit annotation can be more liberally sprinkled around.
> >With read locks lockdep isn't going to complain if the read-side
> >isn't nested the same way under all circumstances, so ABBA deadlocks
> >are ok. Which they are, since this is an annotation only.
> >
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >read lock mode lockdep does not catch read side hazards. And we
> >_very_ much want read side hazards to be caught. For full details of
> >this limitation see
> >
> >commit e91498589746065e3ae95d9a00b068e525eec34f
> >Author: Peter Zijlstra 
> >Date:   Wed Aug 23 13:13:11 2017 +0200
> >
> >locking/lockdep/selftests: Add mixed read-write ABBA tests
> >
> > - To allow nesting of the read-side explicit annotations we explicitly
> >keep track of the nesting. lock_is_held() allows us to do that.
> >
> > - The wait-side annotation is a write lock, and entirely done within
> >dma_fence_wait() for everyone by default.
> >
> > - To be able to freely annotate helper functions I want to make it ok
> >to call dma_fence_begin/end_signalling from soft/hardirq context.
> >First attempt was using the hardirq locking context for the write
> >side in lockdep, but this forces all normal spinlocks nested within
> >dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> >
> >The approach now is to simple check in_atomic(), and for these cases
> >entirely rely on the might_sleep() check in dma_fence_wait(). That
> >will catch any wrong nesting against spinlocks from soft/hardirq
> >contexts.
> >
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> >
> > The main class of deadlocks this is supposed to catch are:
> >
> > Thread A:
> >
> >   mutex_lock(A);
> >   mutex_unlock(A);
> >
> >   dma_fence_signal();
> >
> > Thread B:
> >
> >   mutex_lock(A);
> >   dma_fence_wait();
> >   mutex_unlock(A);
> >
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> >
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> >
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> >
> > v3: Kerneldoc.
> >
> > v4: Some spelling fixes from Mika
> >
> > Cc: Mika Kuoppala 
> > Cc: Thomas Hellstrom 
> > Cc: linux-me...@vger.kernel.org
> > Cc: linaro-mm-...@lists.linaro.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-...@lists.freedesktop.org
> > Cc: Chris Wilson 
> > Cc: Maarten Lankhorst 
> > Cc: Christian König 
> > Signed-off-by: Daniel Vetter 
> > ---
> >   Documentation/driver-api/dma-buf.rst |  12 +-
> >   drivers/dma-buf/dma-fence.c  | 161 +++
> >   include/linux/dma-fence.h|  12 ++
> >   3 files changed, 182 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/driver-api/dma-buf.rst 
> > b/Documentation/driver-api/dma-buf.rst
> > index 63dec76d1d8d..05d856131140 100644
> > --- a/Documentation/driver-api/dma-buf.rst
> > +++ b/Documentation/driver-api/dma-buf.rst
> > @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects
> >   .. kernel-doc:: drivers/dma-buf/dma-buf.c
> >  :doc: cpu access
> >
> > -Fence Poll Support
> > -~~
> > +Implicit Fence Poll Support
> > +~~~
> >
> >   .. 

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-10 Thread Tvrtko Ursulin


On 04/06/2020 09:12, Daniel Vetter wrote:

Design is similar to the lockdep annotations for workers, but with
some twists:

- We use a read-lock for the execution/worker/completion side, so that
   this explicit annotation can be more liberally sprinkled around.
   With read locks lockdep isn't going to complain if the read-side
   isn't nested the same way under all circumstances, so ABBA deadlocks
   are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive
   read lock mode lockdep does not catch read side hazards. And we
   _very_ much want read side hazards to be caught. For full details of
   this limitation see

   commit e91498589746065e3ae95d9a00b068e525eec34f
   Author: Peter Zijlstra 
   Date:   Wed Aug 23 13:13:11 2017 +0200

   locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly
   keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within
   dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok
   to call dma_fence_begin/end_signalling from soft/hardirq context.
   First attempt was using the hardirq locking context for the write
   side in lockdep, but this forces all normal spinlocks nested within
   dma_fence_begin/end_signalling to be spinlocks. That bollocks.

   The approach now is to simple check in_atomic(), and for these cases
   entirely rely on the might_sleep() check in dma_fence_wait(). That
   will catch any wrong nesting against spinlocks from soft/hardirq
   contexts.

The idea here is that every code path that's critical for eventually
signalling a dma_fence should be annotated with
dma_fence_begin/end_signalling. The annotation ideally starts right
after a dma_fence is published (added to a dma_resv, exposed as a
sync_file fd, attached to a drm_syncobj fd, or anything else that
makes the dma_fence visible to other kernel threads), up to and
including the dma_fence_wait(). Examples are irq handlers, the
scheduler rt threads, the tail of execbuf (after the corresponding
fences are visible), any workers that end up signalling dma_fences and
really anything else. Not annotated should be code paths that only
complete fences opportunistically as the gpu progresses, like e.g.
shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A);
mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A);
dma_fence_wait();
mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala 
Cc: Thomas Hellstrom 
Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---
  Documentation/driver-api/dma-buf.rst |  12 +-
  drivers/dma-buf/dma-fence.c  | 161 +++
  include/linux/dma-fence.h|  12 ++
  3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst 
b/Documentation/driver-api/dma-buf.rst
index 63dec76d1d8d..05d856131140 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects
  .. kernel-doc:: drivers/dma-buf/dma-buf.c
 :doc: cpu access
  
-Fence Poll Support

-~~
+Implicit Fence Poll Support
+~~~
  
  .. kernel-doc:: drivers/dma-buf/dma-buf.c

-   :doc: fence polling
+   :doc: implicit fence polling
  
  Kernel Functions and Structures Reference

  ~
@@ -133,6 +133,12 @@ DMA Fences
  .. kernel-doc:: drivers/dma-buf/dma-fence.c
 :doc: DMA fences overview
  
+DMA Fence Signalling Annotations

+
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: fence signalling annotation
+
  DMA Fences Functions Reference
  

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-04 Thread Daniel Vetter
On Thu, Jun 4, 2020 at 11:27 AM Chris Wilson  wrote:
>
> Quoting Daniel Vetter (2020-06-04 10:21:46)
> > On Thu, Jun 4, 2020 at 10:57 AM Thomas Hellström (Intel)
> >  wrote:
> > >
> > >
> > > On 6/4/20 10:12 AM, Daniel Vetter wrote:
> > > ...
> > > > Thread A:
> > > >
> > > >   mutex_lock(A);
> > > >   mutex_unlock(A);
> > > >
> > > >   dma_fence_signal();
> > > >
> > > > Thread B:
> > > >
> > > >   mutex_lock(A);
> > > >   dma_fence_wait();
> > > >   mutex_unlock(A);
> > > >
> > > > Thread B is blocked on A signalling the fence, but A never gets around
> > > > to that because it cannot acquire the lock A.
> > > >
> > > > Note that dma_fence_wait() is allowed to be nested within
> > > > dma_fence_begin/end_signalling sections. To allow this to happen the
> > > > read lock needs to be upgraded to a write lock, which means that any
> > > > other lock is acquired between the dma_fence_begin_signalling() call and
> > > > the call to dma_fence_wait(), and still held, this will result in an
> > > > immediate lockdep complaint. The only other option would be to not
> > > > annotate such calls, defeating the point. Therefore these annotations
> > > > cannot be sprinkled over the code entirely mindless to avoid false
> > > > positives.
> > >
> > > Just realized, isn't that example actually a true positive, or at least
> > > a great candidate for a true positive, since if another thread reenters
> > > that signaling path, it will block on that mutex, and the fence would
> > > never be signaled unless there is another signaling path?
> >
> > Not sure I understand fully, but I think the answer is "it's complicated".
>
> See cd8084f91c02 ("locking/lockdep: Apply crossrelease to completions")
>
> dma_fence usage here is nothing but another name for a completion.

Quoting from my previous cover letter:

"I've dragged my feet for years on this, hoping that cross-release lockdep
would do this for us, but well that never really happened unfortunately.
So here we are."

I discussed this with Peter, cross-release not getting in is pretty
final it seems. The trouble is false positives without explicit
begin/end annotations reviewed by humans - ime from just these few
examples you just can't guess this stuff by computeres, you need real
brains thinking about all the edge cases, and where exactly the
critical section starts and ends. Without that you're just going to
drown in a sea of false positives and yuck.

So yeah I had hopes for cross-release too, unfortunately that was
entirely in vain and a distraction.

Now I guess it would be nice if there's a per-class
completion_begin/end annotation for the more generic problem. But then
also most people don't have a cross-driver completion api contract
like dma_fence is, with some of the most ridiculous over the top
constraints of what's possible and what's not possible on each side of
the cross-release. We do have a bit an outsized benefit (in pain
reduction) vs cost ratio here.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx