[Mesa-dev] NOTICE! Renaming Mesa "master" branch to "main" tomorrow

2021-05-04 Thread Jordan Justen
We plan to proceed with the branch rename tomorrow mid-day (US Pacific
time) tomorrow, May 5.

If all goes well, this MR will merge,

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10474

and the master branch will be locked from further changes.

The main branch will then be created and all open merge-requests will
be changed to target the main branch. This will change the "update
time" of all open MRs, but they should remain in the same order as the
oldest ones will be updated first.

Additional related tasks are being tracked in:

https://gitlab.freedesktop.org/mesa/mesa/-/issues/4501

-Jordan
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Jason Ekstrand
On Tue, May 4, 2021 at 12:16 PM Marek Olšák  wrote:
>
> I see some mentions of XNACK and recoverable page faults. Note that all 
> gaming AMD hw that has userspace queues doesn't have XNACK, so there is no 
> overhead in compute units. My understanding is that recoverable page faults 
> are still supported without XNACK, but instead of the compute unit replaying 
> the faulting instruction, the L1 cache does that. Anyway, the point is that 
> XNACK is totally irrelevant here.
>
> Marek
>
> On Tue., May 4, 2021, 08:48 Christian König, 
>  wrote:
>>
>> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
>> > On Tue, May 4, 2021 at 12:53 PM Christian König
>> >  wrote:
>> >> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
>> >>> [SNIP]
>>  Yeah, it just takes to long for the preemption to complete to be really
>>  useful for the feature we are discussing here.
>> 
>>  As I said when the kernel requests to preempt a queue we can easily 
>>  expect a
>>  timeout of ~100ms until that comes back. For compute that is even in the
>>  multiple seconds range.
>> >>> 100ms for preempting an idle request sounds like broken hw to me. Of
>> >>> course preemting something that actually runs takes a while, that's
>> >>> nothing new. But it's also not the thing we're talking about here. Is 
>> >>> this
>> >>> 100ms actual numbers from hw for an actual idle ringbuffer?
>> >> Well 100ms is just an example of the scheduler granularity. Let me
>> >> explain in a wider context.
>> >>
>> >> The hardware can have X queues mapped at the same time and every Y time
>> >> interval the hardware scheduler checks if those queues have changed and
>> >> only if they have changed the necessary steps to reload them are started.
>> >>
>> >> Multiple queues can be rendering at the same time, so you can have X as
>> >> a high priority queue active and just waiting for a signal to start and
>> >> the client rendering one frame after another and a third background
>> >> compute task mining bitcoins for you.
>> >>
>> >> As long as everything is static this is perfectly performant. Adding a
>> >> queue to the list of active queues is also relatively simple, but taking
>> >> one down requires you to wait until we are sure the hardware has seen
>> >> the change and reloaded the queues.
>> >>
>> >> Think of it as an RCU grace period. This is simply not something which
>> >> is made to be used constantly, but rather just at process termination.
>> > Uh ... that indeed sounds rather broken.
>>
>> Well I wouldn't call it broken. It's just not made for the use case we
>> are trying to abuse it for.
>>
>> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
>>
>> Yeah, exactly that's why it isn't much of a problem for process
>> termination or freeing memory.
>>
>> > So by and large everyone should already be able to cope with it taking a
>> > bit longer. So from a design pov I don't see a huge problem, but I
>> > guess you guys wont be happy since it means on amd hw there will be
>> > random unsightly stalls in desktop linux usage.
>> >
>>  The "preemption" feature is really called suspend and made just for the 
>>  case
>>  when we want to put a process to sleep or need to forcefully kill it for
>>  misbehavior or stuff like that. It is not meant to be used in normal
>>  operation.
>> 
>>  If we only attach it on ->move then yeah maybe a last resort 
>>  possibility to
>>  do it this way, but I think in that case we could rather stick with 
>>  kernel
>>  submissions.
>> >>> Well this is a hybrid userspace ring + kernel augmeted submit mode, so 
>> >>> you
>> >>> can keep dma-fences working. Because the dma-fence stuff wont work with
>> >>> pure userspace submit, I think that conclusion is rather solid. Once more
>> >>> even after this long thread here.
>> >> When assisted with unload fences, then yes. Problem is that I can't see
>> >> how we could implement those performant currently.
>> > Is there really no way to fix fw here? Like if process start/teardown
>> > takes 100ms, that's going to suck no matter what.
>>
>> As I said adding the queue is unproblematic and teardown just results in
>> a bit more waiting to free things up.
>>
>> Problematic is more overcommit swapping and OOM situations which need to
>> wait for the hw scheduler to come back and tell us that the queue is now
>> unmapped.
>>
>> > Also, if userspace lies to us and keeps pushing crap into the ring
>> > after it's supposed to be idle: Userspace is already allowed to waste
>> > gpu time. If you're too worried about this set a fairly aggressive
>> > preempt timeout on the unload fence, and kill the context if it takes
>> > longer than what preempting an idle ring should take (because that
>> > would indicate broken/evil userspace).
>>  I think you have the wrong expectation here. It is perfectly valid and
>>  expected for userspace to keep writing commands into the ring buffer.

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Marek Olšák
I see some mentions of XNACK and recoverable page faults. Note that all
gaming AMD hw that has userspace queues doesn't have XNACK, so there is no
overhead in compute units. My understanding is that recoverable page faults
are still supported without XNACK, but instead of the compute unit
replaying the faulting instruction, the L1 cache does that. Anyway, the
point is that XNACK is totally irrelevant here.

Marek

On Tue., May 4, 2021, 08:48 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 12:53 PM Christian König
> >  wrote:
> >> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> >>> [SNIP]
>  Yeah, it just takes to long for the preemption to complete to be
> really
>  useful for the feature we are discussing here.
> 
>  As I said when the kernel requests to preempt a queue we can easily
> expect a
>  timeout of ~100ms until that comes back. For compute that is even in
> the
>  multiple seconds range.
> >>> 100ms for preempting an idle request sounds like broken hw to me. Of
> >>> course preemting something that actually runs takes a while, that's
> >>> nothing new. But it's also not the thing we're talking about here. Is
> this
> >>> 100ms actual numbers from hw for an actual idle ringbuffer?
> >> Well 100ms is just an example of the scheduler granularity. Let me
> >> explain in a wider context.
> >>
> >> The hardware can have X queues mapped at the same time and every Y time
> >> interval the hardware scheduler checks if those queues have changed and
> >> only if they have changed the necessary steps to reload them are
> started.
> >>
> >> Multiple queues can be rendering at the same time, so you can have X as
> >> a high priority queue active and just waiting for a signal to start and
> >> the client rendering one frame after another and a third background
> >> compute task mining bitcoins for you.
> >>
> >> As long as everything is static this is perfectly performant. Adding a
> >> queue to the list of active queues is also relatively simple, but taking
> >> one down requires you to wait until we are sure the hardware has seen
> >> the change and reloaded the queues.
> >>
> >> Think of it as an RCU grace period. This is simply not something which
> >> is made to be used constantly, but rather just at process termination.
> > Uh ... that indeed sounds rather broken.
>
> Well I wouldn't call it broken. It's just not made for the use case we
> are trying to abuse it for.
>
> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
>
> Yeah, exactly that's why it isn't much of a problem for process
> termination or freeing memory.
>
> > So by and large everyone should already be able to cope with it taking a
> > bit longer. So from a design pov I don't see a huge problem, but I
> > guess you guys wont be happy since it means on amd hw there will be
> > random unsightly stalls in desktop linux usage.
> >
>  The "preemption" feature is really called suspend and made just for
> the case
>  when we want to put a process to sleep or need to forcefully kill it
> for
>  misbehavior or stuff like that. It is not meant to be used in normal
>  operation.
> 
>  If we only attach it on ->move then yeah maybe a last resort
> possibility to
>  do it this way, but I think in that case we could rather stick with
> kernel
>  submissions.
> >>> Well this is a hybrid userspace ring + kernel augmeted submit mode, so
> you
> >>> can keep dma-fences working. Because the dma-fence stuff wont work with
> >>> pure userspace submit, I think that conclusion is rather solid. Once
> more
> >>> even after this long thread here.
> >> When assisted with unload fences, then yes. Problem is that I can't see
> >> how we could implement those performant currently.
> > Is there really no way to fix fw here? Like if process start/teardown
> > takes 100ms, that's going to suck no matter what.
>
> As I said adding the queue is unproblematic and teardown just results in
> a bit more waiting to free things up.
>
> Problematic is more overcommit swapping and OOM situations which need to
> wait for the hw scheduler to come back and tell us that the queue is now
> unmapped.
>
> > Also, if userspace lies to us and keeps pushing crap into the ring
> > after it's supposed to be idle: Userspace is already allowed to waste
> > gpu time. If you're too worried about this set a fairly aggressive
> > preempt timeout on the unload fence, and kill the context if it takes
> > longer than what preempting an idle ring should take (because that
> > would indicate broken/evil userspace).
>  I think you have the wrong expectation here. It is perfectly valid and
>  expected for userspace to keep writing commands into the ring buffer.
> 
>  After all when one frame is completed they want to immediately start
>  rendering the next one.
> >>> Sure, for the true userspace direct 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 04, 2021 at 02:48:35PM +0200, Christian König wrote:
> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 12:53 PM Christian König
> >  wrote:
> > > Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > > Yeah, it just takes to long for the preemption to complete to be 
> > > > > really
> > > > > useful for the feature we are discussing here.
> > > > > 
> > > > > As I said when the kernel requests to preempt a queue we can easily 
> > > > > expect a
> > > > > timeout of ~100ms until that comes back. For compute that is even in 
> > > > > the
> > > > > multiple seconds range.
> > > > 100ms for preempting an idle request sounds like broken hw to me. Of
> > > > course preemting something that actually runs takes a while, that's
> > > > nothing new. But it's also not the thing we're talking about here. Is 
> > > > this
> > > > 100ms actual numbers from hw for an actual idle ringbuffer?
> > > Well 100ms is just an example of the scheduler granularity. Let me
> > > explain in a wider context.
> > > 
> > > The hardware can have X queues mapped at the same time and every Y time
> > > interval the hardware scheduler checks if those queues have changed and
> > > only if they have changed the necessary steps to reload them are started.
> > > 
> > > Multiple queues can be rendering at the same time, so you can have X as
> > > a high priority queue active and just waiting for a signal to start and
> > > the client rendering one frame after another and a third background
> > > compute task mining bitcoins for you.
> > > 
> > > As long as everything is static this is perfectly performant. Adding a
> > > queue to the list of active queues is also relatively simple, but taking
> > > one down requires you to wait until we are sure the hardware has seen
> > > the change and reloaded the queues.
> > > 
> > > Think of it as an RCU grace period. This is simply not something which
> > > is made to be used constantly, but rather just at process termination.
> > Uh ... that indeed sounds rather broken.
> 
> Well I wouldn't call it broken. It's just not made for the use case we are
> trying to abuse it for.
> 
> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
> 
> Yeah, exactly that's why it isn't much of a problem for process termination
> or freeing memory.

Ok so your hw really hates the unload fence. On ours the various queues
are a bit more explicit, so largely unload/preempt is the same as context
switch and pretty quick. Afaik at least.

Still baffled that you can't fix this in fw, but oh well. Judging from how
fast our fw team moves I'm not surprised :-/

Anyway so next plan: Make this work exactly like hmm:
1. wait for the user fence as a dma-fence fake thing, tdr makes this safe
2. remove pte
3. do synchronous tlb flush

Tada, no more 100ms stall in your buffer move callbacks. And feel free to
pack up 2&3 into an async worker or something if it takes too long and
treating it as a bo move dma_fence is better. Also that way you might be
able to batch up the tlb flushing if it's too damn expensive, by
collecting them all under a single dma_fence (and starting a new tlb flush
cycle every time ->enable_signalling gets called).

As long as you nack any gpu faults and don't try to fill them for these
legacy contexts that support dma-fence there's no harm in using the hw
facilities.

Ofc if you're now telling me your synchronous tlb flush is also 100ms,
then maybe just throw the hw out the window, and accept that the
millisecond anything evicts anything (good look with userptr) the screen
freezes for a bit.

> > So by and large everyone should already be able to cope with it taking a
> > bit longer. So from a design pov I don't see a huge problem, but I
> > guess you guys wont be happy since it means on amd hw there will be
> > random unsightly stalls in desktop linux usage.
> > 
> > > > > The "preemption" feature is really called suspend and made just for 
> > > > > the case
> > > > > when we want to put a process to sleep or need to forcefully kill it 
> > > > > for
> > > > > misbehavior or stuff like that. It is not meant to be used in normal
> > > > > operation.
> > > > > 
> > > > > If we only attach it on ->move then yeah maybe a last resort 
> > > > > possibility to
> > > > > do it this way, but I think in that case we could rather stick with 
> > > > > kernel
> > > > > submissions.
> > > > Well this is a hybrid userspace ring + kernel augmeted submit mode, so 
> > > > you
> > > > can keep dma-fences working. Because the dma-fence stuff wont work with
> > > > pure userspace submit, I think that conclusion is rather solid. Once 
> > > > more
> > > > even after this long thread here.
> > > When assisted with unload fences, then yes. Problem is that I can't see
> > > how we could implement those performant currently.
> > Is there really no way to fix fw here? Like if process start/teardown
> > takes 100ms, that's going to suck no matter what.
> 
> As I said 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 13:13 schrieb Daniel Vetter:

On Tue, May 4, 2021 at 12:53 PM Christian König
 wrote:

Am 04.05.21 um 11:47 schrieb Daniel Vetter:

[SNIP]

Yeah, it just takes to long for the preemption to complete to be really
useful for the feature we are discussing here.

As I said when the kernel requests to preempt a queue we can easily expect a
timeout of ~100ms until that comes back. For compute that is even in the
multiple seconds range.

100ms for preempting an idle request sounds like broken hw to me. Of
course preemting something that actually runs takes a while, that's
nothing new. But it's also not the thing we're talking about here. Is this
100ms actual numbers from hw for an actual idle ringbuffer?

Well 100ms is just an example of the scheduler granularity. Let me
explain in a wider context.

The hardware can have X queues mapped at the same time and every Y time
interval the hardware scheduler checks if those queues have changed and
only if they have changed the necessary steps to reload them are started.

Multiple queues can be rendering at the same time, so you can have X as
a high priority queue active and just waiting for a signal to start and
the client rendering one frame after another and a third background
compute task mining bitcoins for you.

As long as everything is static this is perfectly performant. Adding a
queue to the list of active queues is also relatively simple, but taking
one down requires you to wait until we are sure the hardware has seen
the change and reloaded the queues.

Think of it as an RCU grace period. This is simply not something which
is made to be used constantly, but rather just at process termination.

Uh ... that indeed sounds rather broken.


Well I wouldn't call it broken. It's just not made for the use case we 
are trying to abuse it for.



Otoh it's just a dma_fence that'd we'd inject as this unload-fence.


Yeah, exactly that's why it isn't much of a problem for process 
termination or freeing memory.



So by and large everyone should already be able to cope with it taking a
bit longer. So from a design pov I don't see a huge problem, but I
guess you guys wont be happy since it means on amd hw there will be
random unsightly stalls in desktop linux usage.


The "preemption" feature is really called suspend and made just for the case
when we want to put a process to sleep or need to forcefully kill it for
misbehavior or stuff like that. It is not meant to be used in normal
operation.

If we only attach it on ->move then yeah maybe a last resort possibility to
do it this way, but I think in that case we could rather stick with kernel
submissions.

Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
can keep dma-fences working. Because the dma-fence stuff wont work with
pure userspace submit, I think that conclusion is rather solid. Once more
even after this long thread here.

When assisted with unload fences, then yes. Problem is that I can't see
how we could implement those performant currently.

Is there really no way to fix fw here? Like if process start/teardown
takes 100ms, that's going to suck no matter what.


As I said adding the queue is unproblematic and teardown just results in 
a bit more waiting to free things up.


Problematic is more overcommit swapping and OOM situations which need to 
wait for the hw scheduler to come back and tell us that the queue is now 
unmapped.



Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

I think you have the wrong expectation here. It is perfectly valid and
expected for userspace to keep writing commands into the ring buffer.

After all when one frame is completed they want to immediately start
rendering the next one.

Sure, for the true userspace direct submit model. But with that you don't
get dma-fence, which means this gpu will not work for 3d accel on any
current linux desktop.

I'm not sure of that. I've looked a bit into how we could add user
fences to dma_resv objects and that isn't that hard after all.

I think as a proof of concept it's fine, but as an actual solution ...
pls no. Two reasons:
- implicit sync is bad


Well can't disagree with that :) But I think we can't avoid supporting it.


- this doesn't fix anything for explicit sync using dma_fence in terms
of sync_file or drm_syncobj.


Exactly.

If we do implicit sync or explicit sync is orthogonal to the problems 
that sync must be made reliable somehow.


So when we sync and timeout the waiter should just continue, but whoever 
failed to signal will be punished.


But since this isn't solved on Windows I don't see how we can solve it 
on Linux either.



So if we go with the route of 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 4, 2021 at 12:53 PM Christian König
 wrote:
>
> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> > [SNIP]
> >> Yeah, it just takes to long for the preemption to complete to be really
> >> useful for the feature we are discussing here.
> >>
> >> As I said when the kernel requests to preempt a queue we can easily expect 
> >> a
> >> timeout of ~100ms until that comes back. For compute that is even in the
> >> multiple seconds range.
> > 100ms for preempting an idle request sounds like broken hw to me. Of
> > course preemting something that actually runs takes a while, that's
> > nothing new. But it's also not the thing we're talking about here. Is this
> > 100ms actual numbers from hw for an actual idle ringbuffer?
>
> Well 100ms is just an example of the scheduler granularity. Let me
> explain in a wider context.
>
> The hardware can have X queues mapped at the same time and every Y time
> interval the hardware scheduler checks if those queues have changed and
> only if they have changed the necessary steps to reload them are started.
>
> Multiple queues can be rendering at the same time, so you can have X as
> a high priority queue active and just waiting for a signal to start and
> the client rendering one frame after another and a third background
> compute task mining bitcoins for you.
>
> As long as everything is static this is perfectly performant. Adding a
> queue to the list of active queues is also relatively simple, but taking
> one down requires you to wait until we are sure the hardware has seen
> the change and reloaded the queues.
>
> Think of it as an RCU grace period. This is simply not something which
> is made to be used constantly, but rather just at process termination.

Uh ... that indeed sounds rather broken.

Otoh it's just a dma_fence that'd we'd inject as this unload-fence. So
by and large everyone should already be able to cope with it taking a
bit longer. So from a design pov I don't see a huge problem, but I
guess you guys wont be happy since it means on amd hw there will be
random unsightly stalls in desktop linux usage.

> >> The "preemption" feature is really called suspend and made just for the 
> >> case
> >> when we want to put a process to sleep or need to forcefully kill it for
> >> misbehavior or stuff like that. It is not meant to be used in normal
> >> operation.
> >>
> >> If we only attach it on ->move then yeah maybe a last resort possibility to
> >> do it this way, but I think in that case we could rather stick with kernel
> >> submissions.
> > Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
> > can keep dma-fences working. Because the dma-fence stuff wont work with
> > pure userspace submit, I think that conclusion is rather solid. Once more
> > even after this long thread here.
>
> When assisted with unload fences, then yes. Problem is that I can't see
> how we could implement those performant currently.

Is there really no way to fix fw here? Like if process start/teardown
takes 100ms, that's going to suck no matter what.

> >>> Also, if userspace lies to us and keeps pushing crap into the ring
> >>> after it's supposed to be idle: Userspace is already allowed to waste
> >>> gpu time. If you're too worried about this set a fairly aggressive
> >>> preempt timeout on the unload fence, and kill the context if it takes
> >>> longer than what preempting an idle ring should take (because that
> >>> would indicate broken/evil userspace).
> >> I think you have the wrong expectation here. It is perfectly valid and
> >> expected for userspace to keep writing commands into the ring buffer.
> >>
> >> After all when one frame is completed they want to immediately start
> >> rendering the next one.
> > Sure, for the true userspace direct submit model. But with that you don't
> > get dma-fence, which means this gpu will not work for 3d accel on any
> > current linux desktop.
>
> I'm not sure of that. I've looked a bit into how we could add user
> fences to dma_resv objects and that isn't that hard after all.

I think as a proof of concept it's fine, but as an actual solution ...
pls no. Two reasons:
- implicit sync is bad
- this doesn't fix anything for explicit sync using dma_fence in terms
of sync_file or drm_syncobj.

So if we go with the route of papering over this in the kernel, then
it'll be a ton more work than just hacking something into dma_resv.

> > Which sucks, hence some hybrid model of using the userspace ring and
> > kernel augmented submit is needed. Which was my idea.
>
> Yeah, I think when our firmware folks would really remove the kernel
> queue and we still don't have

Yeah I think kernel queue can be removed. But the price is that you
need reasonable fast preempt of idle contexts.

I really can't understand how this can take multiple ms, something
feels very broken in the design of the fw (since obviously the hw can
preempt an idle context to another one pretty fast, or you'd render
any multi-client desktop as a 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 11:47 schrieb Daniel Vetter:

[SNIP]

Yeah, it just takes to long for the preemption to complete to be really
useful for the feature we are discussing here.

As I said when the kernel requests to preempt a queue we can easily expect a
timeout of ~100ms until that comes back. For compute that is even in the
multiple seconds range.

100ms for preempting an idle request sounds like broken hw to me. Of
course preemting something that actually runs takes a while, that's
nothing new. But it's also not the thing we're talking about here. Is this
100ms actual numbers from hw for an actual idle ringbuffer?


Well 100ms is just an example of the scheduler granularity. Let me 
explain in a wider context.


The hardware can have X queues mapped at the same time and every Y time 
interval the hardware scheduler checks if those queues have changed and 
only if they have changed the necessary steps to reload them are started.


Multiple queues can be rendering at the same time, so you can have X as 
a high priority queue active and just waiting for a signal to start and 
the client rendering one frame after another and a third background 
compute task mining bitcoins for you.


As long as everything is static this is perfectly performant. Adding a 
queue to the list of active queues is also relatively simple, but taking 
one down requires you to wait until we are sure the hardware has seen 
the change and reloaded the queues.


Think of it as an RCU grace period. This is simply not something which 
is made to be used constantly, but rather just at process termination.



The "preemption" feature is really called suspend and made just for the case
when we want to put a process to sleep or need to forcefully kill it for
misbehavior or stuff like that. It is not meant to be used in normal
operation.

If we only attach it on ->move then yeah maybe a last resort possibility to
do it this way, but I think in that case we could rather stick with kernel
submissions.

Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
can keep dma-fences working. Because the dma-fence stuff wont work with
pure userspace submit, I think that conclusion is rather solid. Once more
even after this long thread here.


When assisted with unload fences, then yes. Problem is that I can't see 
how we could implement those performant currently.



Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

I think you have the wrong expectation here. It is perfectly valid and
expected for userspace to keep writing commands into the ring buffer.

After all when one frame is completed they want to immediately start
rendering the next one.

Sure, for the true userspace direct submit model. But with that you don't
get dma-fence, which means this gpu will not work for 3d accel on any
current linux desktop.


I'm not sure of that. I've looked a bit into how we could add user 
fences to dma_resv objects and that isn't that hard after all.



Which sucks, hence some hybrid model of using the userspace ring and
kernel augmented submit is needed. Which was my idea.


Yeah, I think when our firmware folks would really remove the kernel 
queue and we still don't have





[SNIP]
Can't find that of hand either, but see the amdgpu_noretry module option.

It basically tells the hardware if retry page faults should be supported or
not because this whole TLB shutdown thing when they are supported is
extremely costly.

Hm so synchronous tlb shootdown is a lot more costly when you allow
retrying of page faults?


Partially correct, yes.

See when you have retry page faults enabled and unmap something you need 
to make sure that everybody which could have potentially translated that 
page and has a TLB is either invalidated or waited until the access is 
completed.


Since every CU could be using a memory location that takes ages to 
completed compared to the normal invalidation where you just invalidate 
the L1/L2 and are done.


Additional to that the recovery adds some extra overhead to every memory 
access, so even without a fault you are quite a bit slower if this is 
enabled.



That sounds bad, because for full hmm mode you need to be able to retry
pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
hang your gpu for a long time while it's waiting for the va->pa lookup
response to return. So retrying lookups shouldn't be any different really.

And you also need fairly fast synchronous tlb shootdown for hmm. So if
your hw has a problem with both together that sounds bad.


Completely agree. And since it was my job to validate the implementation 
on Vega10 I was also the first one to 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 04, 2021 at 11:14:06AM +0200, Christian König wrote:
> Am 04.05.21 um 10:27 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 10:09 AM Christian König
> >  wrote:
> > > Am 04.05.21 um 09:32 schrieb Daniel Vetter:
> > > > On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> > > > > Unfortunately as I pointed out to Daniel as well this won't work 100%
> > > > > reliable either.
> > > > You're claiming this, but there's no clear reason why really, and you
> > > > did't reply to my last mail on that sub-thread, so I really don't get
> > > > where exactly you're seeing a problem.
> > > Yeah, it's rather hard to explain without pointing out how the hardware
> > > works in detail.
> > > 
> > > > > See the signal on the ring buffer needs to be protected by 
> > > > > manipulation from
> > > > > userspace so that we can guarantee that the hardware really has 
> > > > > finished
> > > > > executing when it fires.
> > > > Nope you don't. Userspace is already allowed to submit all kinds of 
> > > > random
> > > > garbage, the only thing the kernel has to guarnatee is:
> > > > - the dma-fence DAG stays a DAG
> > > > - dma-fence completes in finite time
> > > > 
> > > > Everything else is not the kernel's problem, and if userspace mixes 
> > > > stuff
> > > > up like manipulates the seqno, that's ok. It can do that kind of garbage
> > > > already.
> > > > 
> > > > > Protecting memory by immediate page table updates is a good first 
> > > > > step, but
> > > > > unfortunately not sufficient (and we would need to restructure large 
> > > > > parts
> > > > > of the driver to make this happen).
> > > > This is why you need the unload-fence on top, because indeed you can't
> > > > just rely on the fences created from the userspace ring, those are
> > > > unreliable for memory management.
> > > And exactly that's the problem! We can't provide a reliable unload-fence
> > > and the user fences are unreliable for that.
> > > 
> > > I've talked this through lengthy with our hardware/firmware guy last
> > > Thursday but couldn't find a solution either.
> > > 
> > > We can have a preemption fence for the kernel which says: Hey this queue
> > > was scheduled away you can touch it's hardware descriptor, control
> > > registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
> > > But that one is only triggered on preemption and then we have the same
> > > ordering problems once more.
> > > 
> > > Or we can have a end of operation fence for userspace which says: Hey
> > > this queue has finished it's batch of execution, but this one is
> > > manipulable from userspace in both finish to early (very very bad for
> > > invalidations and memory management) or finish to late/never (deadlock
> > > prone but fixable by timeout).
> > > 
> > > What we could do is to use the preemption fence to emulate the unload
> > > fence, e.g. something like:
> > > 1. Preempt the queue in fixed intervals (let's say 100ms).
> > > 2. While preempted check if we have reached the checkpoint in question
> > > by looking at the hardware descriptor.
> > > 3. If we have reached the checkpoint signal the unload fence.
> > > 4. If we haven't reached the checkpoint resume the queue again.
> > > 
> > > The problem is that this might introduce a maximum of 100ms delay before
> > > signaling the unload fence and preempt/resume has such a hefty overhead
> > > that we waste a horrible amount of time on it.
> > So your hw can preempt? That's good enough.
> > 
> > The unload fence is just
> > 1. wait for all dma_fence that are based on the userspace ring. This
> > is unreliable, but we don't care because tdr will make it reliable.
> > And once tdr shot down a context we'll force-unload and thrash it
> > completely, which solves the problem.
> > 2. preempt the context, which /should/ now be stuck waiting for more
> > commands to be stuffed into the ringbuffer. Which means your
> > preemption is hopefully fast enough to not matter. If your hw takes
> > forever to preempt an idle ring, I can't help you :-)
> 
> Yeah, it just takes to long for the preemption to complete to be really
> useful for the feature we are discussing here.
> 
> As I said when the kernel requests to preempt a queue we can easily expect a
> timeout of ~100ms until that comes back. For compute that is even in the
> multiple seconds range.

100ms for preempting an idle request sounds like broken hw to me. Of
course preemting something that actually runs takes a while, that's
nothing new. But it's also not the thing we're talking about here. Is this
100ms actual numbers from hw for an actual idle ringbuffer?

> The "preemption" feature is really called suspend and made just for the case
> when we want to put a process to sleep or need to forcefully kill it for
> misbehavior or stuff like that. It is not meant to be used in normal
> operation.
> 
> If we only attach it on ->move then yeah maybe a last resort possibility to
> do it this way, but I think in that case we 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 10:27 schrieb Daniel Vetter:

On Tue, May 4, 2021 at 10:09 AM Christian König
 wrote:

Am 04.05.21 um 09:32 schrieb Daniel Vetter:

On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:

Unfortunately as I pointed out to Daniel as well this won't work 100%
reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.

Yeah, it's rather hard to explain without pointing out how the hardware
works in detail.


See the signal on the ring buffer needs to be protected by manipulation from
userspace so that we can guarantee that the hardware really has finished
executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.


Protecting memory by immediate page table updates is a good first step, but
unfortunately not sufficient (and we would need to restructure large parts
of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.

And exactly that's the problem! We can't provide a reliable unload-fence
and the user fences are unreliable for that.

I've talked this through lengthy with our hardware/firmware guy last
Thursday but couldn't find a solution either.

We can have a preemption fence for the kernel which says: Hey this queue
was scheduled away you can touch it's hardware descriptor, control
registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
But that one is only triggered on preemption and then we have the same
ordering problems once more.

Or we can have a end of operation fence for userspace which says: Hey
this queue has finished it's batch of execution, but this one is
manipulable from userspace in both finish to early (very very bad for
invalidations and memory management) or finish to late/never (deadlock
prone but fixable by timeout).

What we could do is to use the preemption fence to emulate the unload
fence, e.g. something like:
1. Preempt the queue in fixed intervals (let's say 100ms).
2. While preempted check if we have reached the checkpoint in question
by looking at the hardware descriptor.
3. If we have reached the checkpoint signal the unload fence.
4. If we haven't reached the checkpoint resume the queue again.

The problem is that this might introduce a maximum of 100ms delay before
signaling the unload fence and preempt/resume has such a hefty overhead
that we waste a horrible amount of time on it.

So your hw can preempt? That's good enough.

The unload fence is just
1. wait for all dma_fence that are based on the userspace ring. This
is unreliable, but we don't care because tdr will make it reliable.
And once tdr shot down a context we'll force-unload and thrash it
completely, which solves the problem.
2. preempt the context, which /should/ now be stuck waiting for more
commands to be stuffed into the ringbuffer. Which means your
preemption is hopefully fast enough to not matter. If your hw takes
forever to preempt an idle ring, I can't help you :-)


Yeah, it just takes to long for the preemption to complete to be really 
useful for the feature we are discussing here.


As I said when the kernel requests to preempt a queue we can easily 
expect a timeout of ~100ms until that comes back. For compute that is 
even in the multiple seconds range.


The "preemption" feature is really called suspend and made just for the 
case when we want to put a process to sleep or need to forcefully kill 
it for misbehavior or stuff like that. It is not meant to be used in 
normal operation.


If we only attach it on ->move then yeah maybe a last resort possibility 
to do it this way, but I think in that case we could rather stick with 
kernel submissions.



Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).


I think you have the wrong expectation here. It is perfectly valid and 
expected for userspace to keep writing commands into the ring buffer.


After all when one frame is completed they want to immediately start 
rendering the next one.



Again, I'm not seeing the problem. Except if your hw is really
completely busted to the point where it can't even support userspace
ringbuffers properly and with sufficient performance :-P

Of 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 4, 2021 at 10:09 AM Christian König
 wrote:
>
> Am 04.05.21 um 09:32 schrieb Daniel Vetter:
> > On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> >> Unfortunately as I pointed out to Daniel as well this won't work 100%
> >> reliable either.
> > You're claiming this, but there's no clear reason why really, and you
> > did't reply to my last mail on that sub-thread, so I really don't get
> > where exactly you're seeing a problem.
>
> Yeah, it's rather hard to explain without pointing out how the hardware
> works in detail.
>
> >> See the signal on the ring buffer needs to be protected by manipulation 
> >> from
> >> userspace so that we can guarantee that the hardware really has finished
> >> executing when it fires.
> > Nope you don't. Userspace is already allowed to submit all kinds of random
> > garbage, the only thing the kernel has to guarnatee is:
> > - the dma-fence DAG stays a DAG
> > - dma-fence completes in finite time
> >
> > Everything else is not the kernel's problem, and if userspace mixes stuff
> > up like manipulates the seqno, that's ok. It can do that kind of garbage
> > already.
> >
> >> Protecting memory by immediate page table updates is a good first step, but
> >> unfortunately not sufficient (and we would need to restructure large parts
> >> of the driver to make this happen).
> > This is why you need the unload-fence on top, because indeed you can't
> > just rely on the fences created from the userspace ring, those are
> > unreliable for memory management.
>
> And exactly that's the problem! We can't provide a reliable unload-fence
> and the user fences are unreliable for that.
>
> I've talked this through lengthy with our hardware/firmware guy last
> Thursday but couldn't find a solution either.
>
> We can have a preemption fence for the kernel which says: Hey this queue
> was scheduled away you can touch it's hardware descriptor, control
> registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
> But that one is only triggered on preemption and then we have the same
> ordering problems once more.
>
> Or we can have a end of operation fence for userspace which says: Hey
> this queue has finished it's batch of execution, but this one is
> manipulable from userspace in both finish to early (very very bad for
> invalidations and memory management) or finish to late/never (deadlock
> prone but fixable by timeout).
>
> What we could do is to use the preemption fence to emulate the unload
> fence, e.g. something like:
> 1. Preempt the queue in fixed intervals (let's say 100ms).
> 2. While preempted check if we have reached the checkpoint in question
> by looking at the hardware descriptor.
> 3. If we have reached the checkpoint signal the unload fence.
> 4. If we haven't reached the checkpoint resume the queue again.
>
> The problem is that this might introduce a maximum of 100ms delay before
> signaling the unload fence and preempt/resume has such a hefty overhead
> that we waste a horrible amount of time on it.

So your hw can preempt? That's good enough.

The unload fence is just
1. wait for all dma_fence that are based on the userspace ring. This
is unreliable, but we don't care because tdr will make it reliable.
And once tdr shot down a context we'll force-unload and thrash it
completely, which solves the problem.
2. preempt the context, which /should/ now be stuck waiting for more
commands to be stuffed into the ringbuffer. Which means your
preemption is hopefully fast enough to not matter. If your hw takes
forever to preempt an idle ring, I can't help you :-)

Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

Again, I'm not seeing the problem. Except if your hw is really
completely busted to the point where it can't even support userspace
ringbuffers properly and with sufficient performance :-P

Of course if you issue the preempt context request before the
userspace fences have finished (or tdr cleaned up the mess) like you
do in your proposal, then it will be ridiculously expensive and/or
wont work. So just don't do that.

> > btw I thought some more, and I think it's probably best if we only attach
> > the unload-fence in the ->move(_notify) callbacks. Kinda like we already
> > do for async copy jobs. So the overall buffer move sequence would be:
> >
> > 1. wait for (untrusted for kernel, but necessary for userspace
> > correctness) fake dma-fence that rely on the userspace ring
> >
> > 2. unload ctx
> >
> > 3. copy buffer
> >
> > Ofc 2&3 would be done async behind a dma_fence.
> >
> >> On older hardware we often had the situation that for reliable invalidation
> >> we need the guarantee that every previous 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 09:32 schrieb Daniel Vetter:

On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:

Unfortunately as I pointed out to Daniel as well this won't work 100%
reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.


Yeah, it's rather hard to explain without pointing out how the hardware 
works in detail.



See the signal on the ring buffer needs to be protected by manipulation from
userspace so that we can guarantee that the hardware really has finished
executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.


Protecting memory by immediate page table updates is a good first step, but
unfortunately not sufficient (and we would need to restructure large parts
of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.


And exactly that's the problem! We can't provide a reliable unload-fence 
and the user fences are unreliable for that.


I've talked this through lengthy with our hardware/firmware guy last 
Thursday but couldn't find a solution either.


We can have a preemption fence for the kernel which says: Hey this queue 
was scheduled away you can touch it's hardware descriptor, control 
registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again. 
But that one is only triggered on preemption and then we have the same 
ordering problems once more.


Or we can have a end of operation fence for userspace which says: Hey 
this queue has finished it's batch of execution, but this one is 
manipulable from userspace in both finish to early (very very bad for 
invalidations and memory management) or finish to late/never (deadlock 
prone but fixable by timeout).


What we could do is to use the preemption fence to emulate the unload 
fence, e.g. something like:

1. Preempt the queue in fixed intervals (let's say 100ms).
2. While preempted check if we have reached the checkpoint in question 
by looking at the hardware descriptor.

3. If we have reached the checkpoint signal the unload fence.
4. If we haven't reached the checkpoint resume the queue again.

The problem is that this might introduce a maximum of 100ms delay before 
signaling the unload fence and preempt/resume has such a hefty overhead 
that we waste a horrible amount of time on it.




btw I thought some more, and I think it's probably best if we only attach
the unload-fence in the ->move(_notify) callbacks. Kinda like we already
do for async copy jobs. So the overall buffer move sequence would be:

1. wait for (untrusted for kernel, but necessary for userspace
correctness) fake dma-fence that rely on the userspace ring

2. unload ctx

3. copy buffer

Ofc 2&3 would be done async behind a dma_fence.


On older hardware we often had the situation that for reliable invalidation
we need the guarantee that every previous operation has finished executing.
It's not so much of a problem when the next operation has already started,
since then we had the opportunity to do things in between the last and the
next operation. Just see cache invalidation and VM switching for example.

If you have gpu page faults you generally have synchronous tlb
invalidation,


Please tell that our hardware engineers :)

We have two modes of operation, see the whole XNACK on/off discussion on 
the amdgfx mailing list.



so this also shouldn't be a big problem. Combined with the
unload fence at least. If you don't have synchronous tlb invalidate it
gets a bit more nasty and you need to force a preemption to a kernel
context which has the required flushes across all the caches. Slightly
nasty, but the exact same thing would be required for handling page faults
anyway with the direct userspace submit model.

Again I'm not seeing a problem.


Additional to that it doesn't really buy us anything, e.g. there is not much
advantage to this. Writing the ring buffer in userspace and then ringing in
the kernel has the same overhead as doing everything in the kernel in the
first place.

It gets you dma-fence backwards compat without having to rewrite the
entire userspace ecosystem. Also since you have the hw already designed
for ringbuffer in userspace it would be silly to copy that through the cs
ioctl, that's just overhead.

Also I thought the problem you're having is that all the kernel ringbuf
stuff is going away, so the old cs ioctl wont work anymore for sure?


We still have a bit more time for this. As I learned from our firmware 
engineer 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> Unfortunately as I pointed out to Daniel as well this won't work 100%
> reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.

> See the signal on the ring buffer needs to be protected by manipulation from
> userspace so that we can guarantee that the hardware really has finished
> executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.

> Protecting memory by immediate page table updates is a good first step, but
> unfortunately not sufficient (and we would need to restructure large parts
> of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.

btw I thought some more, and I think it's probably best if we only attach
the unload-fence in the ->move(_notify) callbacks. Kinda like we already
do for async copy jobs. So the overall buffer move sequence would be:

1. wait for (untrusted for kernel, but necessary for userspace
correctness) fake dma-fence that rely on the userspace ring

2. unload ctx

3. copy buffer

Ofc 2&3 would be done async behind a dma_fence.

> On older hardware we often had the situation that for reliable invalidation
> we need the guarantee that every previous operation has finished executing.
> It's not so much of a problem when the next operation has already started,
> since then we had the opportunity to do things in between the last and the
> next operation. Just see cache invalidation and VM switching for example.

If you have gpu page faults you generally have synchronous tlb
invalidation, so this also shouldn't be a big problem. Combined with the
unload fence at least. If you don't have synchronous tlb invalidate it
gets a bit more nasty and you need to force a preemption to a kernel
context which has the required flushes across all the caches. Slightly
nasty, but the exact same thing would be required for handling page faults
anyway with the direct userspace submit model.

Again I'm not seeing a problem.

> Additional to that it doesn't really buy us anything, e.g. there is not much
> advantage to this. Writing the ring buffer in userspace and then ringing in
> the kernel has the same overhead as doing everything in the kernel in the
> first place.

It gets you dma-fence backwards compat without having to rewrite the
entire userspace ecosystem. Also since you have the hw already designed
for ringbuffer in userspace it would be silly to copy that through the cs
ioctl, that's just overhead.

Also I thought the problem you're having is that all the kernel ringbuf
stuff is going away, so the old cs ioctl wont work anymore for sure?

Maybe also pick up that other subthread which ended with my last reply.

Cheers, Daniel


> 
> Christian.
> 
> Am 04.05.21 um 05:11 schrieb Marek Olšák:
> > Proposal for a new CS ioctl, kernel pseudo code:
> > 
> > lock(_lock);
> > serial = get_next_serial(dev);
> > add_wait_command(ring, serial - 1);
> > add_exec_cmdbuf(ring, user_cmdbuf);
> > add_signal_command(ring, serial);
> > *ring->doorbell = FIRE;
> > unlock(_lock);
> > 
> > See? Just like userspace submit, but in the kernel without
> > concurrency/preemption. Is this now safe enough for dma_fence?
> > 
> > Marek
> > 
> > On Mon, May 3, 2021 at 4:36 PM Marek Olšák  > > wrote:
> > 
> > What about direct submit from the kernel where the process still
> > has write access to the GPU ring buffer but doesn't use it? I
> > think that solves your preemption example, but leaves a potential
> > backdoor for a process to overwrite the signal commands, which
> > shouldn't be a problem since we are OK with timeouts.
> > 
> > Marek
> > 
> > On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
> > mailto:ja...@jlekstrand.net>> wrote:
> > 
> > On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
> > mailto:b...@basnieuwenhuizen.nl>> wrote:
> > >
> > > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
> > mailto:ja...@jlekstrand.net>> wrote:
> > > >
> > > > Sorry for the top-post but there's no good thing to reply
> > to here...
> > > >
> > > > One of the things pointed out to me recently by Daniel
> > Vetter that I
> > > > didn't fully understand before is that dma_buf has a very
> > subtle
> > > > second requirement beyond finite time completion:  Nothing
> > required
> 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König
Unfortunately as I pointed out to Daniel as well this won't work 100% 
reliable either.


See the signal on the ring buffer needs to be protected by manipulation 
from userspace so that we can guarantee that the hardware really has 
finished executing when it fires.


Protecting memory by immediate page table updates is a good first step, 
but unfortunately not sufficient (and we would need to restructure large 
parts of the driver to make this happen).


On older hardware we often had the situation that for reliable 
invalidation we need the guarantee that every previous operation has 
finished executing. It's not so much of a problem when the next 
operation has already started, since then we had the opportunity to do 
things in between the last and the next operation. Just see cache 
invalidation and VM switching for example.


Additional to that it doesn't really buy us anything, e.g. there is not 
much advantage to this. Writing the ring buffer in userspace and then 
ringing in the kernel has the same overhead as doing everything in the 
kernel in the first place.


Christian.

Am 04.05.21 um 05:11 schrieb Marek Olšák:

Proposal for a new CS ioctl, kernel pseudo code:

lock(_lock);
serial = get_next_serial(dev);
add_wait_command(ring, serial - 1);
add_exec_cmdbuf(ring, user_cmdbuf);
add_signal_command(ring, serial);
*ring->doorbell = FIRE;
unlock(_lock);

See? Just like userspace submit, but in the kernel without 
concurrency/preemption. Is this now safe enough for dma_fence?


Marek

On Mon, May 3, 2021 at 4:36 PM Marek Olšák > wrote:


What about direct submit from the kernel where the process still
has write access to the GPU ring buffer but doesn't use it? I
think that solves your preemption example, but leaves a potential
backdoor for a process to overwrite the signal commands, which
shouldn't be a problem since we are OK with timeouts.

Marek

On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
mailto:ja...@jlekstrand.net>> wrote:

On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
mailto:b...@basnieuwenhuizen.nl>> wrote:
>
> On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
mailto:ja...@jlekstrand.net>> wrote:
> >
> > Sorry for the top-post but there's no good thing to reply
to here...
> >
> > One of the things pointed out to me recently by Daniel
Vetter that I
> > didn't fully understand before is that dma_buf has a very
subtle
> > second requirement beyond finite time completion:  Nothing
required
> > for signaling a dma-fence can allocate memory. Why? 
Because the act
> > of allocating memory may wait on your dma-fence.  This, as
it turns
> > out, is a massively more strict requirement than finite time
> > completion and, I think, throws out all of the proposals
we have so
> > far.
> >
> > Take, for instance, Marek's proposal for userspace
involvement with
> > dma-fence by asking the kernel for a next serial and the
kernel
> > trusting userspace to signal it.  That doesn't work at all if
> > allocating memory to trigger a dma-fence can blow up. 
There's simply
> > no way for the kernel to trust userspace to not do
ANYTHING which
> > might allocate memory.  I don't even think there's a way
userspace can
> > trust itself there.  It also blows up my plan of moving
the fences to
> > transition boundaries.
> >
> > Not sure where that leaves us.
>
> Honestly the more I look at things I think
userspace-signalable fences
> with a timeout sound like they are a valid solution for
these issues.
> Especially since (as has been mentioned countless times in
this email
> thread) userspace already has a lot of ways to cause
timeouts and or
> GPU hangs through GPU work already.
>
> Adding a timeout on the signaling side of a dma_fence would
ensure:
>
> - The dma_fence signals in finite time
> -  If the timeout case does not allocate memory then memory
allocation
> is not a blocker for signaling.
>
> Of course you lose the full dependency graph and we need to
make sure
> garbage collection of fences works correctly when we have
cycles.
> However, the latter sounds very doable and the first sounds
like it is
> to some extent inevitable.
>
> I feel like I'm missing some requirement here given that we
> immediately went to much more complicated things but can't
find it.
> Thoughts?

Timeouts are sufficient to protect the kernel but they make
the fences
unpredictable and unreliable