Re: [Intel-gfx] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-26 Thread Daniel Vetter
On Wed, May 26, 2021 at 3:32 PM Christian König
 wrote:
>
> Am 25.05.21 um 17:23 schrieb Daniel Vetter:
> > On Tue, May 25, 2021 at 5:05 PM Christian König
> >  wrote:
> >> Hi Daniel,
> >>
> >> Am 25.05.21 um 15:05 schrieb Daniel Vetter:
> >>> Hi Christian,
> >>>
> >>> On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
>  Am 21.05.21 um 20:31 schrieb Daniel Vetter:
>  This works by adding the fence of the last eviction DMA operation to BOs
>  when their backing store is newly allocated. That's what the
>  ttm_bo_add_move_fence() function you stumbled over is good for: 
>  https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
> 
>  Now the problem is it is possible that the application is terminated 
>  before
>  it can complete it's command submission. But since resource management 
>  only
>  waits for the shared fences when there are some there is a chance that we
>  free up memory while it is still in use.
> >>> Hm where is this code? Would help in my audit that I wanted to do this
> >>> week? If you look at most other places like
> >>> drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
> >>> treat the shared fences special and always also include the exclusive one.
> >> See amdgpu_gem_object_close():
> >>
> >> ...
> >>   fence = dma_resv_get_excl(bo->tbo.base.resv);
> >>   if (fence) {
> >>   amdgpu_bo_fence(bo, fence, true);
> >>   fence = NULL;
> >>   }
> >> ...
> >>
> >> We explicitly added that because resource management of some other
> >> driver was going totally bananas without that.
> >>
> >> But I'm not sure which one that was. Maybe dig a bit in the git and
> >> mailing history of that.
> > Hm I looked and it's
> >
> > commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
> > Author: Christian König 
> > Date:   Thu Mar 12 12:03:34 2020 +0100
> >
> > drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
> >
> > That sounded more like amdgpu itself needing this, not another driver?
>
> No, that patch was just a follow up moving the functionality around.

That patch added the "add exclusive fence to shared slots before
amdgpu_vm_clear_freed() call", which I thought was at least part of
your fix.

> > But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
> > right fencing mode for gpu pte clearing, so I'm really not sure what
> > the bug was that you worked around here?The implementation boils down
> > to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
> > there's nothing else that I could find in public history at least, no
> > references to bug reports or anything. I think you need to dig
> > internally, because as-is I'm not seeing the problem here.
> >
> > Or am I missing something here?
>
> See the code here for example:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_fence.c#L361
>
> Nouveau assumes that when a shared fence is present it doesn't need to
> wait for the exclusive one because the shared are always supposed to
> finish after the exclusive one.
>
> But for page table unmap fences that isn't true and we ran into a really
> nasty and hard to reproduce bug because of this.
>
> I think it would be much more defensive if we could say that we always
> wait for the exclusive fence and fix the use case in nouveau and double
> check if somebody else does stuff like that as well.

Yeah most other drivers do the defensive thing here. I think it would
be good to standardize on that. I'll add that to my list and do more
auditing.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-26 Thread Christian König

Am 25.05.21 um 17:23 schrieb Daniel Vetter:

On Tue, May 25, 2021 at 5:05 PM Christian König
 wrote:

Hi Daniel,

Am 25.05.21 um 15:05 schrieb Daniel Vetter:

Hi Christian,

On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:

Am 21.05.21 um 20:31 schrieb Daniel Vetter:
This works by adding the fence of the last eviction DMA operation to BOs
when their backing store is newly allocated. That's what the
ttm_bo_add_move_fence() function you stumbled over is good for: 
https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692

Now the problem is it is possible that the application is terminated before
it can complete it's command submission. But since resource management only
waits for the shared fences when there are some there is a chance that we
free up memory while it is still in use.

Hm where is this code? Would help in my audit that I wanted to do this
week? If you look at most other places like
drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
treat the shared fences special and always also include the exclusive one.

See amdgpu_gem_object_close():

...
  fence = dma_resv_get_excl(bo->tbo.base.resv);
  if (fence) {
  amdgpu_bo_fence(bo, fence, true);
  fence = NULL;
  }
...

We explicitly added that because resource management of some other
driver was going totally bananas without that.

But I'm not sure which one that was. Maybe dig a bit in the git and
mailing history of that.

Hm I looked and it's

commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
Author: Christian König 
Date:   Thu Mar 12 12:03:34 2020 +0100

drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4

That sounded more like amdgpu itself needing this, not another driver?


No, that patch was just a follow up moving the functionality around.


But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
right fencing mode for gpu pte clearing, so I'm really not sure what
the bug was that you worked around here?The implementation boils down
to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
there's nothing else that I could find in public history at least, no
references to bug reports or anything. I think you need to dig
internally, because as-is I'm not seeing the problem here.

Or am I missing something here?


See the code here for example: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_fence.c#L361


Nouveau assumes that when a shared fence is present it doesn't need to 
wait for the exclusive one because the shared are always supposed to 
finish after the exclusive one.


But for page table unmap fences that isn't true and we ran into a really 
nasty and hard to reproduce bug because of this.


I think it would be much more defensive if we could say that we always 
wait for the exclusive fence and fix the use case in nouveau and double 
check if somebody else does stuff like that as well.


Christian.


-Daniel


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-25 Thread Daniel Vetter
On Tue, May 25, 2021 at 5:05 PM Christian König
 wrote:
>
> Hi Daniel,
>
> Am 25.05.21 um 15:05 schrieb Daniel Vetter:
> > Hi Christian,
> >
> > On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> >> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> >> This works by adding the fence of the last eviction DMA operation to BOs
> >> when their backing store is newly allocated. That's what the
> >> ttm_bo_add_move_fence() function you stumbled over is good for: 
> >> https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
> >>
> >> Now the problem is it is possible that the application is terminated before
> >> it can complete it's command submission. But since resource management only
> >> waits for the shared fences when there are some there is a chance that we
> >> free up memory while it is still in use.
> > Hm where is this code? Would help in my audit that I wanted to do this
> > week? If you look at most other places like
> > drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
> > treat the shared fences special and always also include the exclusive one.
>
> See amdgpu_gem_object_close():
>
> ...
>  fence = dma_resv_get_excl(bo->tbo.base.resv);
>  if (fence) {
>  amdgpu_bo_fence(bo, fence, true);
>  fence = NULL;
>  }
> ...
>
> We explicitly added that because resource management of some other
> driver was going totally bananas without that.
>
> But I'm not sure which one that was. Maybe dig a bit in the git and
> mailing history of that.

Hm I looked and it's

commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
Author: Christian König 
Date:   Thu Mar 12 12:03:34 2020 +0100

   drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4

That sounded more like amdgpu itself needing this, not another driver?

But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
right fencing mode for gpu pte clearing, so I'm really not sure what
the bug was that you worked around here?The implementation boils down
to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
there's nothing else that I could find in public history at least, no
references to bug reports or anything. I think you need to dig
internally, because as-is I'm not seeing the problem here.

Or am I missing something here?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-25 Thread Daniel Vetter
Hi Christian,

On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> > [SNIP]
> > > We could provide an IOCTL for the BO to change the flag.
> > That's not the semantics we need.
> > 
> > > But could we first figure out the semantics we want to use here?
> > > 
> > > Cause I'm pretty sure we don't actually need those changes at all and as
> > > said before I'm certainly NAKing things which break existing use cases.
> > Please read how other drivers do this and at least _try_ to understand
> > it. I'm really loosing my patience here with you NAKing patches you're
> > not even understanding (or did you actually read and fully understand
> > the entire story I typed up here, and your NAK is on the entire
> > thing?). There's not much useful conversation to be had with that
> > approach. And with drivers I mean kernel + userspace here.
> 
> Well to be honest I did fully read that, but I was just to emotionally
> attached to answer more appropriately in that moment.
> 
> And I'm sorry that I react emotional on that, but it is really frustrating
> that I'm not able to convince you that we have a major problem which affects
> all drivers and not just amdgpu.
> 
> Regarding the reason why I'm NAKing this particular patch, you are breaking
> existing uAPI for RADV with that. And as a maintainer of the driver I have
> simply no other choice than saying halt, stop we can't do it like this.
> 
> I'm perfectly aware that I've some holes in the understanding of how ANV or
> other Vulkan/OpenGL stacks work. But you should probably also admit that you
> have some holes how amdgpu works or otherwise I can't imagine why you
> suggest a patch which simply breaks RADV.
> 
> I mean we are working together for years now and I think you know me pretty
> well, do you really think I scream bloody hell we can't do this without a
> good reason?
> 
> So let's stop throwing halve backed solutions at each other and discuss what
> we can do to solve the different problems we are both seeing here.

Well this was meant to be a goal post/semantics discussion starter. Yes
the patch breaks performance (but not correctness) for amdgpu, but it also
contains my suggestion for how to fix that issue (in text form at least).

Plus a plan how to roll it out so that anyone who cares doesn't hit the
perf issues this patch can cause.

Also the overall series is really meant as a subsystem wide assessment of
the status quo. Aside from a correctness issue Lucas spotted in my panfrost
patches no substantial input from others on this yet unfortunately. I need
to poke more people I think.

Anyway since the plan as a text didn't stick I'm typing up now something
more detailed in form of amdgpu patches. Maybe Bas can do the radv
conversion too.

It won't be complete by far either (I'm not working for amd after all
...), I'll leave out the opengl/media side entirely. But if this works for
radv is should be a useful blueprint for gl/media too (with some changes
in the interfaces, but not really the exposed semantics).

> > That's the other frustration part: You're trying to fix this purely in
> > the kernel. This is exactly one of these issues why we require open
> > source userspace, so that we can fix the issues correctly across the
> > entire stack. And meanwhile you're steadfastily refusing to even look
> > at that the userspace side of the picture.
> 
> Well I do fully understand the userspace side of the picture for the AMD
> stack. I just don't think we should give userspace that much control over
> the fences in the dma_resv object without untangling them from resource
> management.
> 
> And RADV is exercising exclusive sync for amdgpu already. You can do
> submission to both the GFX, Compute and SDMA queues in Vulkan and those
> currently won't over-synchronize.
> 
> When you then send a texture generated by multiple engines to the Compositor
> the kernel will correctly inserts waits for all submissions of the other
> process.
> 
> So this already works for RADV and completely without the IOCTL Jason
> proposed. IIRC we also have unit tests which exercised that feature for the
> video decoding use case long before RADV even existed.

Yeah multiple engines on the producer side works fine with the current
scheme you have (if we ignore for now that the way amdgpu uses dma_resv is
different from other drivers by not setting the exclusive slots for
producers).

Where it breaks down is you have overlapping reads once a frame is
generated, on either side. E.g.

- compositors read the buffer as consumer
- but also producer reads the buffer again (maybe reference frame for
  media, or maybe for some post-processing like motion blurr or whatever).

Then your current scheme really badly oversyncs, while other drivers
don't have this issue. Fixing this for amdgpu will have quite a big impact
on how dma_resv will be used by amdgpu, and that's why I think before
we've looked at this it doesn't make sense to

Re: [Intel-gfx] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-22 Thread Christian König

Am 21.05.21 um 20:31 schrieb Daniel Vetter:

[SNIP]

We could provide an IOCTL for the BO to change the flag.

That's not the semantics we need.


But could we first figure out the semantics we want to use here?

Cause I'm pretty sure we don't actually need those changes at all and as
said before I'm certainly NAKing things which break existing use cases.

Please read how other drivers do this and at least _try_ to understand
it. I'm really loosing my patience here with you NAKing patches you're
not even understanding (or did you actually read and fully understand
the entire story I typed up here, and your NAK is on the entire
thing?). There's not much useful conversation to be had with that
approach. And with drivers I mean kernel + userspace here.


Well to be honest I did fully read that, but I was just to emotionally 
attached to answer more appropriately in that moment.


And I'm sorry that I react emotional on that, but it is really 
frustrating that I'm not able to convince you that we have a major 
problem which affects all drivers and not just amdgpu.


Regarding the reason why I'm NAKing this particular patch, you are 
breaking existing uAPI for RADV with that. And as a maintainer of the 
driver I have simply no other choice than saying halt, stop we can't do 
it like this.


I'm perfectly aware that I've some holes in the understanding of how ANV 
or other Vulkan/OpenGL stacks work. But you should probably also admit 
that you have some holes how amdgpu works or otherwise I can't imagine 
why you suggest a patch which simply breaks RADV.


I mean we are working together for years now and I think you know me 
pretty well, do you really think I scream bloody hell we can't do this 
without a good reason?


So let's stop throwing halve backed solutions at each other and discuss 
what we can do to solve the different problems we are both seeing here.



That's the other frustration part: You're trying to fix this purely in
the kernel. This is exactly one of these issues why we require open
source userspace, so that we can fix the issues correctly across the
entire stack. And meanwhile you're steadfastily refusing to even look
at that the userspace side of the picture.


Well I do fully understand the userspace side of the picture for the AMD 
stack. I just don't think we should give userspace that much control 
over the fences in the dma_resv object without untangling them from 
resource management.


And RADV is exercising exclusive sync for amdgpu already. You can do 
submission to both the GFX, Compute and SDMA queues in Vulkan and those 
currently won't over-synchronize.


When you then send a texture generated by multiple engines to the 
Compositor the kernel will correctly inserts waits for all submissions 
of the other process.


So this already works for RADV and completely without the IOCTL Jason 
proposed. IIRC we also have unit tests which exercised that feature for 
the video decoding use case long before RADV even existed.


And yes I have to admit that I haven't thought about interaction with 
other drivers when I came up with this because the rules of that 
interaction wasn't clear to me at that time.



Also I thought through your tlb issue, why are you even putting these
tlb flush fences into the shard dma_resv slots? If you store them
somewhere else in the amdgpu private part, the oversync issues goes
away
- in your ttm bo move callback, you can just make your bo copy job
depend on them too (you have to anyway)
- even for p2p there's not an issue here, because you have the
->move_notify callback, and can then lift the tlb flush fences from
your private place to the shared slots so the exporter can see them.


Because adding a shared fence requires that this shared fence signals 
after the exclusive fence. And this is a perfect example to explain why 
this is so problematic and also why why we currently stumble over that 
only in amdgpu.


In TTM we have a feature which allows evictions to be pipelined and 
don't wait for the evicting DMA operation. Without that driver will 
stall waiting for their allocations to finish when we need to allocate 
memory.


For certain use cases this gives you a ~20% fps increase under memory 
pressure, so it is a really important feature.


This works by adding the fence of the last eviction DMA operation to BOs 
when their backing store is newly allocated. That's what the 
ttm_bo_add_move_fence() function you stumbled over is good for: 
https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692


Now the problem is it is possible that the application is terminated 
before it can complete it's command submission. But since resource 
management only waits for the shared fences when there are some there is 
a chance that we free up memory while it is still in use.


Because of this we have some rather crude workarounds in amdgpu. For 
example IIRC we manual wait for any potential exclusive fence before 
freeing memory.


We cou

Re: [Intel-gfx] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Daniel Vetter
On Fri, May 21, 2021 at 8:08 PM Christian König
 wrote:
>
> Am 21.05.21 um 17:16 schrieb Daniel Vetter:
> > On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:
> >> On Fri, May 21, 2021 at 4:37 PM Daniel Vetter  wrote:
> >>> On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
>  On Fri, May 21, 2021 at 11:10 AM Daniel Vetter  
>  wrote:
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 88a24a0b5691..cc8426e1e8a8 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct 
> > amdgpu_cs_parser *p,
> >  amdgpu_bo_list_for_each_entry(e, p->bo_list) {
> >  struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
> >
> > -   /* Make sure we use the exclusive slot for shared BOs */
> > -   if (bo->prime_shared_count)
> > +   /* Make sure we use the exclusive slot for all 
> > potentially shared BOs */
> > +   if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
> >  e->tv.num_shared = 0;
>  I think it also makes sense to skip this with
>  AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
>  anyone expects implicit sync to happen with those.
> >>> Ah yes, I missed this entirely. So the "no implicit flag" is already
> >>> there, and the _only_ thing that's missing really is a way to fish out the
> >>> implicit fences, and set them.
> >>>
> >>> https://lore.kernel.org/dri-devel/20210520190007.534046-1-ja...@jlekstrand.net/
> >>>
> >>> So I think all that's really needed in radv is not setting
> >>> RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
> >>> are present (means you need to do some import/export and keep the fd
> >>> around for winsys buffers, but shouldn't be too bad), and then control the
> >>> implicit fences entirely explicitly like vk expects.
> >> That is the part I'm less sure about. This is a BO wide flag so we are
> >> also disabling implicit sync in the compositor. If the compositor does
> >> only do read stuff that is ok, as the inserted exclusive fence will
> >> work for that. But as I learned recently the app provided buffer may
> >> end up being written to by the X server which open a whole can of
> >> potential problems if implicit sync gets disabled between Xserver
> >> operations on the app provided buffer. Hence setting that on the WSI
> >> buffer is a whole new can of potential problems and hence I've said a
> >> submission based flag would be preferred.
> >>
> >> I can certainly try it out though.
> > Hm yeah that's the wrong flag. We need a flag on the drm_file which the
> > explicit userspace sets. And which is valid only for itself.
> >
> > There's a nice flags field when creating a ctx, but it's not validated and
> > there's already a comment that we have to filter out garbage priority, so
> > that's not use. I'll whip up something entirely untested just as a draft.
>
> We could provide an IOCTL for the BO to change the flag.

That's not the semantics we need.

> But could we first figure out the semantics we want to use here?
>
> Cause I'm pretty sure we don't actually need those changes at all and as
> said before I'm certainly NAKing things which break existing use cases.

Please read how other drivers do this and at least _try_ to understand
it. I'm really loosing my patience here with you NAKing patches you're
not even understanding (or did you actually read and fully understand
the entire story I typed up here, and your NAK is on the entire
thing?). There's not much useful conversation to be had with that
approach. And with drivers I mean kernel + userspace here.

That's the other frustration part: You're trying to fix this purely in
the kernel. This is exactly one of these issues why we require open
source userspace, so that we can fix the issues correctly across the
entire stack. And meanwhile you're steadfastily refusing to even look
at that the userspace side of the picture.

Also I thought through your tlb issue, why are you even putting these
tlb flush fences into the shard dma_resv slots? If you store them
somewhere else in the amdgpu private part, the oversync issues goes
away
- in your ttm bo move callback, you can just make your bo copy job
depend on them too (you have to anyway)
- even for p2p there's not an issue here, because you have the
->move_notify callback, and can then lift the tlb flush fences from
your private place to the shared slots so the exporter can see them.

The kernel move fences otoh are a bit more nasty to wring through the
p2p dma-buf interface. That one probably needs something new.
-Daniel

>
> Regard

Re: [Intel-gfx] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Christian König

Am 21.05.21 um 17:16 schrieb Daniel Vetter:

On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:

On Fri, May 21, 2021 at 4:37 PM Daniel Vetter  wrote:

On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:

On Fri, May 21, 2021 at 11:10 AM Daniel Vetter  wrote:

---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 88a24a0b5691..cc8426e1e8a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 amdgpu_bo_list_for_each_entry(e, p->bo_list) {
 struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);

-   /* Make sure we use the exclusive slot for shared BOs */
-   if (bo->prime_shared_count)
+   /* Make sure we use the exclusive slot for all potentially 
shared BOs */
+   if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
 e->tv.num_shared = 0;

I think it also makes sense to skip this with
AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
anyone expects implicit sync to happen with those.

Ah yes, I missed this entirely. So the "no implicit flag" is already
there, and the _only_ thing that's missing really is a way to fish out the
implicit fences, and set them.

https://lore.kernel.org/dri-devel/20210520190007.534046-1-ja...@jlekstrand.net/

So I think all that's really needed in radv is not setting
RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
are present (means you need to do some import/export and keep the fd
around for winsys buffers, but shouldn't be too bad), and then control the
implicit fences entirely explicitly like vk expects.

That is the part I'm less sure about. This is a BO wide flag so we are
also disabling implicit sync in the compositor. If the compositor does
only do read stuff that is ok, as the inserted exclusive fence will
work for that. But as I learned recently the app provided buffer may
end up being written to by the X server which open a whole can of
potential problems if implicit sync gets disabled between Xserver
operations on the app provided buffer. Hence setting that on the WSI
buffer is a whole new can of potential problems and hence I've said a
submission based flag would be preferred.

I can certainly try it out though.

Hm yeah that's the wrong flag. We need a flag on the drm_file which the
explicit userspace sets. And which is valid only for itself.

There's a nice flags field when creating a ctx, but it's not validated and
there's already a comment that we have to filter out garbage priority, so
that's not use. I'll whip up something entirely untested just as a draft.


We could provide an IOCTL for the BO to change the flag.

But could we first figure out the semantics we want to use here?

Cause I'm pretty sure we don't actually need those changes at all and as 
said before I'm certainly NAKing things which break existing use cases.


Regards,
Christian.


-Daniel




Are you bored enough to type this up for radv? I'll give Jason's kernel
stuff another review meanwhile.
-Daniel


 e->bo_va = amdgpu_vm_bo_find(vm, bo);
 }
--
2.31.0


--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Daniel Vetter
On Fri, May 21, 2021 at 07:58:57AM -0700, Rob Clark wrote:
> On Fri, May 21, 2021 at 2:10 AM Daniel Vetter  wrote:
> >
> > - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
> >   but because it doesn't use the drm/scheduler it handles fences from
> >   the wrong context with a synchronous dma_fence_wait. See
> >   submit_fence_sync() leading to msm_gem_sync_object(). Investing into
> >   a scheduler might be a good idea.
> 
> Yeah, drm/scheduler is (along with a lot of other things) on the TODO
> list.  But this isn't quite as bad as it sounds because userspace uses
> a u_queue thread to call the submit ioctl rather than blocking the
> driver.  (It also offloads some other work from the driver thread,
> like submit merging to reduce # of ioctls.  Coincidentally that
> arrangement was a step towards preparing userspace for some
> hypothetical non-ioctl uapi ;-))

You're also holding a pile of locks, which I expect latest with
multi-engine buffer sharing will be pain. If you push this to the
scheduler then the locks aren't held. Or maybe I've misread the flow, it's
become all a bit a blurr after all these drivers :-)

> OTOH it would be good to move blocking until the system can free
> enough pages to repin bo's out of the ioctl path to better handle some
> memory pressure corner cases without having to be interruptable over a
> lot more of the submit path..  Running chrome+android on devices
> without a lot of memory is fun..

Uh that one needs the userspace thread. Or entirely different semantics of
your ioctl, because you're not allowed to allocate memory once any
dma_fence is visible. So offloading the entire pinning to a submit thread
is no-go.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Rob Clark
On Fri, May 21, 2021 at 2:10 AM Daniel Vetter  wrote:
>
> - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
>   but because it doesn't use the drm/scheduler it handles fences from
>   the wrong context with a synchronous dma_fence_wait. See
>   submit_fence_sync() leading to msm_gem_sync_object(). Investing into
>   a scheduler might be a good idea.

Yeah, drm/scheduler is (along with a lot of other things) on the TODO
list.  But this isn't quite as bad as it sounds because userspace uses
a u_queue thread to call the submit ioctl rather than blocking the
driver.  (It also offloads some other work from the driver thread,
like submit merging to reduce # of ioctls.  Coincidentally that
arrangement was a step towards preparing userspace for some
hypothetical non-ioctl uapi ;-))

OTOH it would be good to move blocking until the system can free
enough pages to repin bo's out of the ioctl path to better handle some
memory pressure corner cases without having to be interruptable over a
lot more of the submit path..  Running chrome+android on devices
without a lot of memory is fun..

BR,
-R
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx