date:20210427

On Wed., Apr. 28, 2021, 00:01 Jason Ekstrand,  wrote:

> On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák  wrote:
> >
> > Jason, both memory-based signalling as well as interrupt-based
> signalling to the CPU would be supported by amdgpu. External devices don't
> need to support memory-based sync objects. The only limitation is that they
> can't convert amdgpu sync objects to dma_fence.
>
> Sure.  I'm not worried about the mechanism.  We just need a word that
> means "the new fence thing" and I've been throwing "memory fence"
> around for that.  Other mechanisms may work as well.
>
> > The sad thing is that "external -> amdgpu" dependencies are really
> "external <-> amdgpu" dependencies due to mutually-exclusive access
> required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the
> only interop that would initially work with those buffers. Explicitly
> sync'd buffers also won't work if other drivers convert explicit fences to
> dma_fence. Thus, both implicit sync and explicit sync might not work with
> other drivers at all. The only interop that would initially work is
> explicit fences with memory-based waiting and signalling on the external
> device to keep the kernel out of the picture.
>
> Yup.  This is where things get hard.  That said, I'm not quite ready
> to give up on memory/interrupt fences just yet.
>
> One thought that came to mind which might help would be if we added an
> extremely strict concept of memory ownership.  The idea would be that
> any given BO would be in one of two states at any given time:
>
>  1. legacy: dma_fences and implicit sync works as normal but it cannot
> be resident in any "modern" (direct submission, ULLS, whatever you
> want to call it) context
>
>  2. modern: In this mode they should not be used by any legacy
> context.  We can't strictly prevent this, unfortunately, but maybe we
> can say reading produces garbage and writes may be discarded.  In this
> mode, they can be bound to modern contexts.
>
> In theory, when in "modern" mode, you could bind the same buffer in
> multiple modern contexts at a time.  However, when that's the case, it
> makes ownership really tricky to track.  Therefore, we might want some
> sort of dma-buf create flag for "always modern" vs. "switchable" and
> only allow binding to one modern context at a time when it's
> switchable.
>
> If we did this, we may be able to move any dma_fence shenanigans to
> the ownership transition points.  We'd still need some sort of "wait
> for fence and transition" which has a timeout.  However, then we'd be
> fairly well guaranteed that the application (not just Mesa!) has
> really and truly decided it's done with the buffer and we wouldn't (I
> hope!) end up with the accidental edges in the dependency graph.
>
> Of course, I've not yet proven any of this correct so feel free to
> tell me why it won't work. :-)  It was just one of those "about to go
> to bed and had a thunk" type thoughts.
>

We'd like to keep userspace outside of Mesa drivers intact and working
except for interop where we don't have much choice. At the same time,
future hw may remove support for kernel queues, so we might not have much
choice there either, depending on what the hw interface will look like.

The idea is to have an ioctl for querying a timeline semaphore buffer
associated with a shared BO, and an ioctl for querying the next wait and
signal number (e.g. n and n+1) for that semaphore. Waiting for n would be
like mutex lock and signaling would be like mutex unlock. The next process
would use the same ioctl and get n+1 and n+2, etc. There is a deadlock
condition because one process can do lock A, lock B, and another can do
lock B, lock A, which can be prevented such that the ioctl that returns the
numbers would return them for multiple buffers at once. This solution needs
no changes to userspace outside of Mesa drivers, and we'll also keep the BO
wait ioctl for GPU-CPU sync.

Marek


> --Jason
>
> P.S.  Daniel was 100% right when he said this discussion needs a glossary.
>
>
> > Marek
> >
> >
> > On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand 
> wrote:
> >>
> >> Trying to figure out which e-mail in this mess is the right one to
> reply to
> >>
> >> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> >> > > Ok. So that would only make the following use cases broken for now:
> >> > > - amd render -> external gpu
> >>
> >> Assuming said external GPU doesn't support memory fences.  If we do
> >> amdgpu and i915 at the same time, that covers basically most of the
> >> external GPU use-cases.  Of course, we'd want to convert nouveau as
> >> well for the rest.
> >>
> >> > > - amd video encode -> network device
> >> >
> >> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> >> > unhappy, as we have some cases where we are combining an AMD APU with
> a
> >> > FPGA based graphics card. I can't go into

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák  wrote:
>
> Jason, both memory-based signalling as well as interrupt-based signalling to 
> the CPU would be supported by amdgpu. External devices don't need to support 
> memory-based sync objects. The only limitation is that they can't convert 
> amdgpu sync objects to dma_fence.

Sure.  I'm not worried about the mechanism.  We just need a word that
means "the new fence thing" and I've been throwing "memory fence"
around for that.  Other mechanisms may work as well.

> The sad thing is that "external -> amdgpu" dependencies are really "external 
> <-> amdgpu" dependencies due to mutually-exclusive access required by 
> non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the only interop 
> that would initially work with those buffers. Explicitly sync'd buffers also 
> won't work if other drivers convert explicit fences to dma_fence. Thus, both 
> implicit sync and explicit sync might not work with other drivers at all. The 
> only interop that would initially work is explicit fences with memory-based 
> waiting and signalling on the external device to keep the kernel out of the 
> picture.

Yup.  This is where things get hard.  That said, I'm not quite ready
to give up on memory/interrupt fences just yet.

One thought that came to mind which might help would be if we added an
extremely strict concept of memory ownership.  The idea would be that
any given BO would be in one of two states at any given time:

 1. legacy: dma_fences and implicit sync works as normal but it cannot
be resident in any "modern" (direct submission, ULLS, whatever you
want to call it) context

 2. modern: In this mode they should not be used by any legacy
context.  We can't strictly prevent this, unfortunately, but maybe we
can say reading produces garbage and writes may be discarded.  In this
mode, they can be bound to modern contexts.

In theory, when in "modern" mode, you could bind the same buffer in
multiple modern contexts at a time.  However, when that's the case, it
makes ownership really tricky to track.  Therefore, we might want some
sort of dma-buf create flag for "always modern" vs. "switchable" and
only allow binding to one modern context at a time when it's
switchable.

If we did this, we may be able to move any dma_fence shenanigans to
the ownership transition points.  We'd still need some sort of "wait
for fence and transition" which has a timeout.  However, then we'd be
fairly well guaranteed that the application (not just Mesa!) has
really and truly decided it's done with the buffer and we wouldn't (I
hope!) end up with the accidental edges in the dependency graph.

Of course, I've not yet proven any of this correct so feel free to
tell me why it won't work. :-)  It was just one of those "about to go
to bed and had a thunk" type thoughts.

--Jason

P.S.  Daniel was 100% right when he said this discussion needs a glossary.

> Marek
>
>
> On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand  wrote:
>>
>> Trying to figure out which e-mail in this mess is the right one to reply 
>> to
>>
>> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach  wrote:
>> >
>> > Hi,
>> >
>> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
>> > > Ok. So that would only make the following use cases broken for now:
>> > > - amd render -> external gpu
>>
>> Assuming said external GPU doesn't support memory fences.  If we do
>> amdgpu and i915 at the same time, that covers basically most of the
>> external GPU use-cases.  Of course, we'd want to convert nouveau as
>> well for the rest.
>>
>> > > - amd video encode -> network device
>> >
>> > FWIW, "only" breaking amd render -> external gpu will make us pretty
>> > unhappy, as we have some cases where we are combining an AMD APU with a
>> > FPGA based graphics card. I can't go into the specifics of this use-
>> > case too much but basically the AMD graphics is rendering content that
>> > gets composited on top of a live video pipeline running through the
>> > FPGA.
>>
>> I think it's worth taking a step back and asking what's being here
>> before we freak out too much.  If we do go this route, it doesn't mean
>> that your FPGA use-case can't work, it just means it won't work
>> out-of-the box anymore.  You'll have to separate execution and memory
>> dependencies inside your FPGA driver.  That's still not great but it's
>> not as bad as you maybe made it sound.
>>
>> > > What about the case when we get a buffer from an external device and
>> > > we're supposed to make it "busy" when we are using it, and the
>> > > external device wants to wait until we stop using it? Is it something
>> > > that can happen, thus turning "external -> amd" into "external <->
>> > > amd"?
>> >
>> > Zero-copy texture sampling from a video input certainly appreciates
>> > this very much. Trying to pass the render fence through the various
>> > layers of userspace to be able to tell when the video input can reuse a
>> > buffer is a great experience in yak

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Jason, both memory-based signalling as well as interrupt-based signalling
to the CPU would be supported by amdgpu. External devices don't need to
support memory-based sync objects. The only limitation is that they can't
convert amdgpu sync objects to dma_fence.

The sad thing is that "external -> amdgpu" dependencies are really
"external <-> amdgpu" dependencies due to mutually-exclusive access
required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the
only interop that would initially work with those buffers. Explicitly
sync'd buffers also won't work if other drivers convert explicit fences to
dma_fence. Thus, both implicit sync and explicit sync might not work with
other drivers at all. The only interop that would initially work is
explicit fences with memory-based waiting and signalling on the external
device to keep the kernel out of the picture.

Marek


On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand  wrote:

> Trying to figure out which e-mail in this mess is the right one to reply
> to
>
> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach 
> wrote:
> >
> > Hi,
> >
> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> > > Ok. So that would only make the following use cases broken for now:
> > > - amd render -> external gpu
>
> Assuming said external GPU doesn't support memory fences.  If we do
> amdgpu and i915 at the same time, that covers basically most of the
> external GPU use-cases.  Of course, we'd want to convert nouveau as
> well for the rest.
>
> > > - amd video encode -> network device
> >
> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > unhappy, as we have some cases where we are combining an AMD APU with a
> > FPGA based graphics card. I can't go into the specifics of this use-
> > case too much but basically the AMD graphics is rendering content that
> > gets composited on top of a live video pipeline running through the
> > FPGA.
>
> I think it's worth taking a step back and asking what's being here
> before we freak out too much.  If we do go this route, it doesn't mean
> that your FPGA use-case can't work, it just means it won't work
> out-of-the box anymore.  You'll have to separate execution and memory
> dependencies inside your FPGA driver.  That's still not great but it's
> not as bad as you maybe made it sound.
>
> > > What about the case when we get a buffer from an external device and
> > > we're supposed to make it "busy" when we are using it, and the
> > > external device wants to wait until we stop using it? Is it something
> > > that can happen, thus turning "external -> amd" into "external <->
> > > amd"?
> >
> > Zero-copy texture sampling from a video input certainly appreciates
> > this very much. Trying to pass the render fence through the various
> > layers of userspace to be able to tell when the video input can reuse a
> > buffer is a great experience in yak shaving. Allowing the video input
> > to reuse the buffer as soon as the read dma_fence from the GPU is
> > signaled is much more straight forward.
>
> Oh, it's definitely worse than that.  Every window system interaction
> is bi-directional.  The X server has to wait on the client before
> compositing from it and the client has to wait on X before re-using
> that back-buffer.  Of course, we can break that later dependency by
> doing a full CPU wait but that's going to mean either more latency or
> reserving more back buffers.  There's no good clean way to claim that
> any of this is one-directional.
>
> --Jason
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

On Tue, Apr 27, 2021 at 1:38 PM Dave Airlie  wrote:
>
> On Tue, 27 Apr 2021 at 22:06, Christian König
>  wrote:
> >
> > Correct, we wouldn't have synchronization between device with and without 
> > user queues any more.
> >
> > That could only be a problem for A+I Laptops.
>
> Since I think you mentioned you'd only be enabling this on newer
> chipsets, won't it be a problem for A+A where one A is a generation
> behind the other?
>
> I'm not really liking where this is going btw, seems like a ill
> thought out concept, if AMD is really going down the road of designing
> hw that is currently Linux incompatible, you are going to have to
> accept a big part of the burden in bringing this support in to more
> than just amd drivers for upcoming generations of gpu.

In case my previous e-mail sounded too enthusiastic, I'm also pensive
about this direction.  I'm not sure I'm ready to totally give up on
all of Linux WSI just yet.  We definitely want to head towards memory
fences and direct submission but I'm not convinced that throwing out
all of interop is necessary.  It's certainly a very big hammer and we
should try to figure out something less destructive, if that's
possible.  (I don't know for sure that it is.)

--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Trying to figure out which e-mail in this mess is the right one to reply to

On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach  wrote:
>
> Hi,
>
> Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> > Ok. So that would only make the following use cases broken for now:
> > - amd render -> external gpu

Assuming said external GPU doesn't support memory fences.  If we do
amdgpu and i915 at the same time, that covers basically most of the
external GPU use-cases.  Of course, we'd want to convert nouveau as
well for the rest.

> > - amd video encode -> network device
>
> FWIW, "only" breaking amd render -> external gpu will make us pretty
> unhappy, as we have some cases where we are combining an AMD APU with a
> FPGA based graphics card. I can't go into the specifics of this use-
> case too much but basically the AMD graphics is rendering content that
> gets composited on top of a live video pipeline running through the
> FPGA.

I think it's worth taking a step back and asking what's being here
before we freak out too much.  If we do go this route, it doesn't mean
that your FPGA use-case can't work, it just means it won't work
out-of-the box anymore.  You'll have to separate execution and memory
dependencies inside your FPGA driver.  That's still not great but it's
not as bad as you maybe made it sound.

> > What about the case when we get a buffer from an external device and
> > we're supposed to make it "busy" when we are using it, and the
> > external device wants to wait until we stop using it? Is it something
> > that can happen, thus turning "external -> amd" into "external <->
> > amd"?
>
> Zero-copy texture sampling from a video input certainly appreciates
> this very much. Trying to pass the render fence through the various
> layers of userspace to be able to tell when the video input can reuse a
> buffer is a great experience in yak shaving. Allowing the video input
> to reuse the buffer as soon as the read dma_fence from the GPU is
> signaled is much more straight forward.

Oh, it's definitely worse than that.  Every window system interaction
is bi-directional.  The X server has to wait on the client before
compositing from it and the client has to wait on X before re-using
that back-buffer.  Of course, we can break that later dependency by
doing a full CPU wait but that's going to mean either more latency or
reserving more back buffers.  There's no good clean way to claim that
any of this is one-directional.

--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Supporting interop with any device is always possible. It depends on which
drivers we need to interoperate with and update them. We've already found
the path forward for amdgpu. We just need to find out how many other
drivers need to be updated and evaluate the cost/benefit aspect.

Marek

On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  wrote:

> On Tue, 27 Apr 2021 at 22:06, Christian König
>  wrote:
> >
> > Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >
> > That could only be a problem for A+I Laptops.
>
> Since I think you mentioned you'd only be enabling this on newer
> chipsets, won't it be a problem for A+A where one A is a generation
> behind the other?
>
> I'm not really liking where this is going btw, seems like a ill
> thought out concept, if AMD is really going down the road of designing
> hw that is currently Linux incompatible, you are going to have to
> accept a big part of the burden in bringing this support in to more
> than just amd drivers for upcoming generations of gpu.
>
> Dave.
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] One more thing to cut from the main branch...

Maybe, maybe not  I mean, normally I'd be all for it but

https://rosenzweig.io/blog/asahi-gpu-part-3.html

--Jason

On Tue, Apr 27, 2021 at 12:32 PM Ian Romanick  wrote:
>
> If we're going to cut all the classic drivers and a handful of older
> Gallium drivers... can we also cut Apple GLX?  Apple comes around every
> couple years to fix breakages that have crept in, and we periodically
> have compile breaks that need fixing (see
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/4702).  As far as I
> can tell, having it in the main branch provides zero value to anyone...
> including Apple.
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Dave Airlie

On Tue, 27 Apr 2021 at 22:06, Christian König
 wrote:
>
> Correct, we wouldn't have synchronization between device with and without 
> user queues any more.
>
> That could only be a problem for A+I Laptops.

Since I think you mentioned you'd only be enabling this on newer
chipsets, won't it be a problem for A+A where one A is a generation
behind the other?

I'm not really liking where this is going btw, seems like a ill
thought out concept, if AMD is really going down the road of designing
hw that is currently Linux incompatible, you are going to have to
accept a big part of the burden in bringing this support in to more
than just amd drivers for upcoming generations of gpu.

Dave.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Simon Ser

On Tuesday, April 27th, 2021 at 8:01 PM, Alex Deucher  
wrote:

> It's an upcoming requirement for windows[1], so you are likely to
> start seeing this across all GPU vendors that support windows. I
> think the timing depends on how quickly the legacy hardware support
> sticks around for each vendor.

Hm, okay.

Will using the existing explicit synchronization APIs make it work
properly? (e.g. IN_FENCE_FD + OUT_FENCE_PTR in KMS, EGL_KHR_fence_sync +
EGL_ANDROID_native_fence_sync + EGL_KHR_wait_sync in EGL)
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Alex Deucher

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
>
> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
> wrote:
>
> > > Ok. So that would only make the following use cases broken for now:
> > >
> > > - amd render -> external gpu
> > > - amd video encode -> network device
> >
> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > unhappy
>
> I concur. I have quite a few users with a multi-GPU setup involving
> AMD hardware.
>
> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> error, and not bad results on screen because nothing is synchronized
> anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Alex


[1] - 
https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/


> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Simon Ser

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:

> > Ok. So that would only make the following use cases broken for now:
> >
> > - amd render -> external gpu
> > - amd video encode -> network device
>
> FWIW, "only" breaking amd render -> external gpu will make us pretty
> unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

[Mesa-dev] One more thing to cut from the main branch...

2021-04-27 Thread Ian Romanick

If we're going to cut all the classic drivers and a handful of older
Gallium drivers... can we also cut Apple GLX?  Apple comes around every
couple years to fix breakages that have crept in, and we periodically
have compile breaks that need fixing (see
https://gitlab.freedesktop.org/mesa/mesa/-/issues/4702).  As far as I
can tell, having it in the main branch provides zero value to anyone...
including Apple.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Lucas Stach

Hi,

Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> Ok. So that would only make the following use cases broken for now:
> - amd render -> external gpu
> - amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy, as we have some cases where we are combining an AMD APU with a
FPGA based graphics card. I can't go into the specifics of this use-
case too much but basically the AMD graphics is rendering content that
gets composited on top of a live video pipeline running through the
FPGA.

> What about the case when we get a buffer from an external device and
> we're supposed to make it "busy" when we are using it, and the
> external device wants to wait until we stop using it? Is it something
> that can happen, thus turning "external -> amd" into "external <->
> amd"?

Zero-copy texture sampling from a video input certainly appreciates
this very much. Trying to pass the render fence through the various
layers of userspace to be able to tell when the video input can reuse a
buffer is a great experience in yak shaving. Allowing the video input
to reuse the buffer as soon as the read dma_fence from the GPU is
signaled is much more straight forward.

Regards,
Lucas

> Marek
> 
> On Tue., Apr. 27, 2021, 08:50 Christian König, < 
> ckoenig.leichtzumer...@gmail.com> wrote:
> >  Only amd -> external.
> >  
> >  We can easily install something in an user queue which waits for a
> > dma_fence in the kernel.
> >  
> >  But we can't easily wait for an user queue as dependency of a
> > dma_fence.
> >  
> >  The good thing is we have this wait before signal case on Vulkan
> > timeline semaphores which have the same problem in the kernel.
> >  
> >  The good news is I think we can relatively easily convert i915 and
> > older amdgpu device to something which is compatible with user
> > fences.
> >  
> >  So yes, getting that fixed case by case should work.
> >  
> >  Christian
> >  
> > Am 27.04.21 um 14:46 schrieb Marek Olšák:
> >  
> > > I'll defer to Christian and Alex to decide whether dropping sync
> > > with non-amd devices (GPUs, cameras etc.) is acceptable.
> > > 
> > > Rewriting those drivers to this new sync model could be done on a
> > > case by case basis.
> > > 
> > > For now, would we only lose the "amd -> external" dependency? Or
> > > the "external -> amd" dependency too?
> > > 
> > > Marek
> > > 
> > > On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, 
> > > wrote:
> > >  
> > > > On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák 
> > > > wrote:
> > > >  > Ok. I'll interpret this as "yes, it will work, let's do it".
> > > >  
> > > >  It works if all you care about is drm/amdgpu. I'm not sure
> > > > that's a
> > > >  reasonable approach for upstream, but it definitely is an
> > > > approach :-)
> > > >  
> > > >  We've already gone somewhat through the pain of drm/amdgpu
> > > > redefining
> > > >  how implicit sync works without sufficiently talking with
> > > > other
> > > >  people, maybe we should avoid a repeat of this ...
> > > >  -Daniel
> > > >  
> > > >  >
> > > >  > Marek
> > > >  >
> > > >  > On Tue., Apr. 27, 2021, 08:06 Christian König,
> > > >  wrote:
> > > >  >>
> > > >  >> Correct, we wouldn't have synchronization between device
> > > > with
> > > > and without user queues any more.
> > > >  >>
> > > >  >> That could only be a problem for A+I Laptops.
> > > >  >>
> > > >  >> Memory management will just work with preemption fences
> > > > which
> > > > pause the user queues of a process before evicting something.
> > > > That will be a dma_fence, but also a well known approach.
> > > >  >>
> > > >  >> Christian.
> > > >  >>
> > > >  >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
> > > >  >>
> > > >  >> If we don't use future fences for DMA fences at all, e.g.
> > > > we
> > > > don't use them for memory management, it can work, right?
> > > > Memory
> > > > management can suspend user queues anytime. It doesn't need to
> > > > use DMA fences. There might be something that I'm missing here.
> > > >  >>
> > > >  >> What would we lose without DMA fences? Just inter-device
> > > > synchronization? I think that might be acceptable.
> > > >  >>
> > > >  >> The only case when the kernel will wait on a future fence
> > > > is
> > > > before a page flip. Everything today already depends on
> > > > userspace
> > > > not hanging the gpu, which makes everything a future fence.
> > > >  >>
> > > >  >> Marek
> > > >  >>
> > > >  >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,
> > > >  wrote:
> > > >  >>>
> > > >  >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák
> > > > wrote:
> > > >  >>> > Thanks everybody. The initial proposal is dead. Here are
> > > > some thoughts on
> > > >  >>> > how to do it differently.
> > > >  >>> >
> > > >  >>> > I think we can have direct command submission from
> > > > userspace via
> > > >  >>> > memory-mapped queues ("user queues") without changing
> > > > window systems.
> > > >  >>> >
> > > >  >>> >

Re: [Mesa-dev] Trying to build a opencl dev env

2021-04-27 Thread Jan Vesely

On Tue, Apr 27, 2021 at 7:50 AM Luke A. Guest  wrote:

>
>
> On 27/04/2021 08:00, Pierre Moreau wrote:
> > Hello Luke,
> >
> > If you set `PKG_CONFIG_PATH=$PATH_TO_LIBCLC_INSTALL/share/pkgconfig` when
> > running meson, it should pick that version instead of the system one.
> >
> > I run it as `PKG_CONFIG_PATH=[…] meson setup […]`; it might also be
> possible to
> > pass it as an argument instead, I do not know.
>
> Thanks for that. It's because I had that var set to
> /lib/pkg-config, libclc installs to /share/pkg-config
> for some unknown reason.
>

that is intentional. as machine-independent libraries, libclc shouldn't be
in $PREFIX/lib/

Jan


> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] Trying to build a opencl dev env

2021-04-27 Thread Luke A. Guest




On 27/04/2021 08:00, Pierre Moreau wrote:

Hello Luke,

If you set `PKG_CONFIG_PATH=$PATH_TO_LIBCLC_INSTALL/share/pkgconfig` when
running meson, it should pick that version instead of the system one.

I run it as `PKG_CONFIG_PATH=[…] meson setup […]`; it might also be possible to
pass it as an argument instead, I do not know.


Thanks for that. It's because I had that var set to 
/lib/pkg-config, libclc installs to /share/pkg-config 
for some unknown reason.

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Uff good question. DMA-buf certainly supports that use case, but I have 
no idea if that is actually used somewhere.

Daniel do you know any case?

Christian.

Am 27.04.21 um 15:26 schrieb Marek Olšák:

Ok. So that would only make the following use cases broken for now:
- amd render -> external gpu
- amd video encode -> network device

What about the case when we get a buffer from an external device and 
we're supposed to make it "busy" when we are using it, and the 
external device wants to wait until we stop using it? Is it something 
that can happen, thus turning "external -> amd" into "external <-> amd"?

Marek

On Tue., Apr. 27, 2021, 08:50 Christian König, 
> wrote:

Only amd -> external.

We can easily install something in an user queue which waits for a
dma_fence in the kernel.

But we can't easily wait for an user queue as dependency of a
dma_fence.

The good thing is we have this wait before signal case on Vulkan
timeline semaphores which have the same problem in the kernel.

The good news is I think we can relatively easily convert i915 and
older amdgpu device to something which is compatible with user fences.

So yes, getting that fixed case by case should work.

Christian

Am 27.04.21 um 14:46 schrieb Marek Olšák:

I'll defer to Christian and Alex to decide whether dropping sync
with non-amd devices (GPUs, cameras etc.) is acceptable.

Rewriting those drivers to this new sync model could be done on a
case by case basis.

For now, would we only lose the "amd -> external" dependency? Or
the "external -> amd" dependency too?

Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, mailto:dan...@ffwll.ch>> wrote:

On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák mailto:mar...@gmail.com>> wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure
that's a
reasonable approach for upstream, but it definitely is an
approach :-)

We've already gone somewhat through the pain of drm/amdgpu
redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König,
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>>
>> Correct, we wouldn't have synchronization between device
with and without user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences
which pause the user queues of a process before evicting
something. That will be a dma_fence, but also a well known
approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g.
we don't use them for memory management, it can work, right?
Memory management can suspend user queues anytime. It doesn't
need to use DMA fences. There might be something that I'm
missing here.
>>
>> What would we lose without DMA fences? Just inter-device
synchronization? I think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence
is before a page flip. Everything today already depends on
userspace not hanging the gpu, which makes everything a
future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,
mailto:dan...@ffwll.ch>> wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here
are some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from
userspace via
>>> > memory-mapped queues ("user queues") without changing
window systems.
>>> >
>>> > The memory management doesn't have to use GPU page
faults like HMM.
>>> > Instead, it can wait for user queues of a specific
process to go idle and
>>> > then unmap the queues, so that userspace can't submit
anything. Buffer
>>> > evictions, pinning, etc. can be executed when all
queues are unmapped
>>> > (suspended). Thus, no BO fences and page faults are needed.
>>> >
>>> > Inter-process synchronization can use timeline
semaphores. Userspace will
>>> > query the wait and signal value for a shared buffer
from the kernel. The
>>> > kernel will keep a history of those queries to know
which process is
>>> >

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Ok. So that would only make the following use cases broken for now:
- amd render -> external gpu
- amd video encode -> network device

What about the case when we get a buffer from an external device and we're
supposed to make it "busy" when we are using it, and the external device
wants to wait until we stop using it? Is it something that can happen, thus
turning "external -> amd" into "external <-> amd"?

Marek

On Tue., Apr. 27, 2021, 08:50 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Only amd -> external.
>
> We can easily install something in an user queue which waits for a
> dma_fence in the kernel.
>
> But we can't easily wait for an user queue as dependency of a dma_fence.
>
> The good thing is we have this wait before signal case on Vulkan timeline
> semaphores which have the same problem in the kernel.
>
> The good news is I think we can relatively easily convert i915 and older
> amdgpu device to something which is compatible with user fences.
>
> So yes, getting that fixed case by case should work.
>
> Christian
>
> Am 27.04.21 um 14:46 schrieb Marek Olšák:
>
> I'll defer to Christian and Alex to decide whether dropping sync with
> non-amd devices (GPUs, cameras etc.) is acceptable.
>
> Rewriting those drivers to this new sync model could be done on a case by
> case basis.
>
> For now, would we only lose the "amd -> external" dependency? Or the
> "external -> amd" dependency too?
>
> Marek
>
> On Tue., Apr. 27, 2021, 08:15 Daniel Vetter,  wrote:
>
>> On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:
>> > Ok. I'll interpret this as "yes, it will work, let's do it".
>>
>> It works if all you care about is drm/amdgpu. I'm not sure that's a
>> reasonable approach for upstream, but it definitely is an approach :-)
>>
>> We've already gone somewhat through the pain of drm/amdgpu redefining
>> how implicit sync works without sufficiently talking with other
>> people, maybe we should avoid a repeat of this ...
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue., Apr. 27, 2021, 08:06 Christian König, <
>> ckoenig.leichtzumer...@gmail.com> wrote:
>> >>
>> >> Correct, we wouldn't have synchronization between device with and
>> without user queues any more.
>> >>
>> >> That could only be a problem for A+I Laptops.
>> >>
>> >> Memory management will just work with preemption fences which pause
>> the user queues of a process before evicting something. That will be a
>> dma_fence, but also a well known approach.
>> >>
>> >> Christian.
>> >>
>> >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>> >>
>> >> If we don't use future fences for DMA fences at all, e.g. we don't use
>> them for memory management, it can work, right? Memory management can
>> suspend user queues anytime. It doesn't need to use DMA fences. There might
>> be something that I'm missing here.
>> >>
>> >> What would we lose without DMA fences? Just inter-device
>> synchronization? I think that might be acceptable.
>> >>
>> >> The only case when the kernel will wait on a future fence is before a
>> page flip. Everything today already depends on userspace not hanging the
>> gpu, which makes everything a future fence.
>> >>
>> >> Marek
>> >>
>> >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>> >>>
>> >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> >>> > Thanks everybody. The initial proposal is dead. Here are some
>> thoughts on
>> >>> > how to do it differently.
>> >>> >
>> >>> > I think we can have direct command submission from userspace via
>> >>> > memory-mapped queues ("user queues") without changing window
>> systems.
>> >>> >
>> >>> > The memory management doesn't have to use GPU page faults like HMM.
>> >>> > Instead, it can wait for user queues of a specific process to go
>> idle and
>> >>> > then unmap the queues, so that userspace can't submit anything.
>> Buffer
>> >>> > evictions, pinning, etc. can be executed when all queues are
>> unmapped
>> >>> > (suspended). Thus, no BO fences and page faults are needed.
>> >>> >
>> >>> > Inter-process synchronization can use timeline semaphores.
>> Userspace will
>> >>> > query the wait and signal value for a shared buffer from the
>> kernel. The
>> >>> > kernel will keep a history of those queries to know which process is
>> >>> > responsible for signalling which buffer. There is only the
>> wait-timeout
>> >>> > issue and how to identify the culprit. One of the solutions is to
>> have the
>> >>> > GPU send all GPU signal commands and all timed out wait commands
>> via an
>> >>> > interrupt to the kernel driver to monitor and validate userspace
>> behavior.
>> >>> > With that, it can be identified whether the culprit is the waiting
>> process
>> >>> > or the signalling process and which one. Invalid signal/wait
>> parameters can
>> >>> > also be detected. The kernel can force-signal only the semaphores
>> that time
>> >>> > out, and punish the processes which caused the timeout or used
>> invalid
>> >>> > signal/wait parameters.
>> >>> >
>> >>> > The

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal


Only amd -> external.

We can easily install something in an user queue which waits for a 
dma_fence in the kernel.


But we can't easily wait for an user queue as dependency of a dma_fence.

The good thing is we have this wait before signal case on Vulkan 
timeline semaphores which have the same problem in the kernel.


The good news is I think we can relatively easily convert i915 and older 
amdgpu device to something which is compatible with user fences.


So yes, getting that fixed case by case should work.

Christian

Am 27.04.21 um 14:46 schrieb Marek Olšák:
I'll defer to Christian and Alex to decide whether dropping sync with 
non-amd devices (GPUs, cameras etc.) is acceptable.


Rewriting those drivers to this new sync model could be done on a case 
by case basis.


For now, would we only lose the "amd -> external" dependency? Or the 
"external -> amd" dependency too?


Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, > wrote:


On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák mailto:mar...@gmail.com>> wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König,
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>>
>> Correct, we wouldn't have synchronization between device with
and without user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences which
pause the user queues of a process before evicting something. That
will be a dma_fence, but also a well known approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g. we
don't use them for memory management, it can work, right? Memory
management can suspend user queues anytime. It doesn't need to use
DMA fences. There might be something that I'm missing here.
>>
>> What would we lose without DMA fences? Just inter-device
synchronization? I think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence is
before a page flip. Everything today already depends on userspace
not hanging the gpu, which makes everything a future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, mailto:dan...@ffwll.ch>> wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here are
some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from userspace via
>>> > memory-mapped queues ("user queues") without changing window
systems.
>>> >
>>> > The memory management doesn't have to use GPU page faults
like HMM.
>>> > Instead, it can wait for user queues of a specific process
to go idle and
>>> > then unmap the queues, so that userspace can't submit
anything. Buffer
>>> > evictions, pinning, etc. can be executed when all queues are
unmapped
>>> > (suspended). Thus, no BO fences and page faults are needed.
>>> >
>>> > Inter-process synchronization can use timeline semaphores.
Userspace will
>>> > query the wait and signal value for a shared buffer from the
kernel. The
>>> > kernel will keep a history of those queries to know which
process is
>>> > responsible for signalling which buffer. There is only the
wait-timeout
>>> > issue and how to identify the culprit. One of the solutions
is to have the
>>> > GPU send all GPU signal commands and all timed out wait
commands via an
>>> > interrupt to the kernel driver to monitor and validate
userspace behavior.
>>> > With that, it can be identified whether the culprit is the
waiting process
>>> > or the signalling process and which one. Invalid signal/wait
parameters can
>>> > also be detected. The kernel can force-signal only the
semaphores that time
>>> > out, and punish the processes which caused the timeout or
used invalid
>>> > signal/wait parameters.
>>> >
>>> > The question is whether this synchronization solution is
robust enough for
>>> > dma_fence and whatever the kernel and window systems need.
>>>
>>> The proper model here is the preempt-ctx dma_fence that amdkfd
uses
>>> (without page faults). That means dma_fence for
synchronization is doa, at
>>> least as-is, and we're back to figuring out the

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

I'll defer to Christian and Alex to decide whether dropping sync with
non-amd devices (GPUs, cameras etc.) is acceptable.

Rewriting those drivers to this new sync model could be done on a case by
case basis.

For now, would we only lose the "amd -> external" dependency? Or the
"external -> amd" dependency too?

Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter,  wrote:

> On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:
> > Ok. I'll interpret this as "yes, it will work, let's do it".
>
> It works if all you care about is drm/amdgpu. I'm not sure that's a
> reasonable approach for upstream, but it definitely is an approach :-)
>
> We've already gone somewhat through the pain of drm/amdgpu redefining
> how implicit sync works without sufficiently talking with other
> people, maybe we should avoid a repeat of this ...
> -Daniel
>
> >
> > Marek
> >
> > On Tue., Apr. 27, 2021, 08:06 Christian König, <
> ckoenig.leichtzumer...@gmail.com> wrote:
> >>
> >> Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >>
> >> That could only be a problem for A+I Laptops.
> >>
> >> Memory management will just work with preemption fences which pause the
> user queues of a process before evicting something. That will be a
> dma_fence, but also a well known approach.
> >>
> >> Christian.
> >>
> >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
> >>
> >> If we don't use future fences for DMA fences at all, e.g. we don't use
> them for memory management, it can work, right? Memory management can
> suspend user queues anytime. It doesn't need to use DMA fences. There might
> be something that I'm missing here.
> >>
> >> What would we lose without DMA fences? Just inter-device
> synchronization? I think that might be acceptable.
> >>
> >> The only case when the kernel will wait on a future fence is before a
> page flip. Everything today already depends on userspace not hanging the
> gpu, which makes everything a future fence.
> >>
> >> Marek
> >>
> >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
> >>>
> >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> >>> > Thanks everybody. The initial proposal is dead. Here are some
> thoughts on
> >>> > how to do it differently.
> >>> >
> >>> > I think we can have direct command submission from userspace via
> >>> > memory-mapped queues ("user queues") without changing window systems.
> >>> >
> >>> > The memory management doesn't have to use GPU page faults like HMM.
> >>> > Instead, it can wait for user queues of a specific process to go
> idle and
> >>> > then unmap the queues, so that userspace can't submit anything.
> Buffer
> >>> > evictions, pinning, etc. can be executed when all queues are unmapped
> >>> > (suspended). Thus, no BO fences and page faults are needed.
> >>> >
> >>> > Inter-process synchronization can use timeline semaphores. Userspace
> will
> >>> > query the wait and signal value for a shared buffer from the kernel.
> The
> >>> > kernel will keep a history of those queries to know which process is
> >>> > responsible for signalling which buffer. There is only the
> wait-timeout
> >>> > issue and how to identify the culprit. One of the solutions is to
> have the
> >>> > GPU send all GPU signal commands and all timed out wait commands via
> an
> >>> > interrupt to the kernel driver to monitor and validate userspace
> behavior.
> >>> > With that, it can be identified whether the culprit is the waiting
> process
> >>> > or the signalling process and which one. Invalid signal/wait
> parameters can
> >>> > also be detected. The kernel can force-signal only the semaphores
> that time
> >>> > out, and punish the processes which caused the timeout or used
> invalid
> >>> > signal/wait parameters.
> >>> >
> >>> > The question is whether this synchronization solution is robust
> enough for
> >>> > dma_fence and whatever the kernel and window systems need.
> >>>
> >>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
> >>> (without page faults). That means dma_fence for synchronization is
> doa, at
> >>> least as-is, and we're back to figuring out the winsys problem.
> >>>
> >>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
> >>> akin to saying that we're solving deadlock issues in a locking design
> by
> >>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
> >>> avoids having to reach the reset button, but that's about it.
> >>>
> >>> And the fundamental problem is that once you throw in userspace command
> >>> submission (and syncing, at least within the userspace driver,
> otherwise
> >>> there's kinda no point if you still need the kernel for cross-engine
> sync)
> >>> means you get deadlocks if you still use dma_fence for sync under
> >>> perfectly legit use-case. We've discussed that one ad nauseam last
> summer:
> >>>
> >>>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
> >>>
> >>>

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Am 27.04.21 um 14:15 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák wrote:

Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...

BTW: This is coming up again for the plan here.

We once more need to think about the "other" fences which don't
participate in the implicit sync here.

Christian.

-Daniel

Marek

On Tue., Apr. 27, 2021, 08:06 Christian König,
wrote:

Correct, we wouldn't have synchronization between device with and without user
queues any more.

That could only be a problem for A+I Laptops.

Memory management will just work with preemption fences which pause the user
queues of a process before evicting something. That will be a dma_fence, but
also a well known approach.

Christian.

Am 27.04.21 um 13:49 schrieb Marek Olšák:

If we don't use future fences for DMA fences at all, e.g. we don't use them for
memory management, it can work, right? Memory management can suspend user
queues anytime. It doesn't need to use DMA fences. There might be something
that I'm missing here.

What would we lose without DMA fences? Just inter-device synchronization? I
think that might be acceptable.

The only case when the kernel will wait on a future fence is before a page
flip. Everything today already depends on userspace not hanging the gpu, which
makes everything a future fence.

Marek

On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, wrote:

On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:

Thanks everybody. The initial proposal is dead. Here are some thoughts on
how to do it differently.

I think we can have direct command submission from userspace via
memory-mapped queues ("user queues") without changing window systems.

The memory management doesn't have to use GPU page faults like HMM.
Instead, it can wait for user queues of a specific process to go idle and
then unmap the queues, so that userspace can't submit anything. Buffer
evictions, pinning, etc. can be executed when all queues are unmapped
(suspended). Thus, no BO fences and page faults are needed.

Inter-process synchronization can use timeline semaphores. Userspace will
query the wait and signal value for a shared buffer from the kernel. The
kernel will keep a history of those queries to know which process is
responsible for signalling which buffer. There is only the wait-timeout
issue and how to identify the culprit. One of the solutions is to have the
GPU send all GPU signal commands and all timed out wait commands via an
interrupt to the kernel driver to monitor and validate userspace behavior.
With that, it can be identified whether the culprit is the waiting process
or the signalling process and which one. Invalid signal/wait parameters can
also be detected. The kernel can force-signal only the semaphores that time
out, and punish the processes which caused the timeout or used invalid
signal/wait parameters.

The question is whether this synchronization solution is robust enough for
dma_fence and whatever the kernel and window systems need.

The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is doa, at
least as-is, and we're back to figuring out the winsys problem.

"We'll solve it with timeouts" is very tempting, but doesn't work. It's
akin to saying that we're solving deadlock issues in a locking design by
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.

And the fundamental problem is that once you throw in userspace command
submission (and syncing, at least within the userspace driver, otherwise
there's kinda no point if you still need the kernel for cross-engine sync)
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last summer:

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences

See silly diagramm at the bottom.

Now I think all isn't lost, because imo the first step to getting to this
brave new world is rebuilding the driver on top of userspace fences, and
with the adjusted cmd submit model. You probably don't want to use amdkfd,
but port that as a context flag or similar to render nodes for gl/vk. Of
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel

Marek

On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone wrote:

Hi,

On Tue, 20 Apr 2021 at 20:30, Daniel Vetter wrote:

The thing is, you can't do this in drm/scheduler. At least not without
splitting up the dma_fence in the kernel into separate

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Daniel Vetter

On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König, 
>  wrote:
>>
>> Correct, we wouldn't have synchronization between device with and without 
>> user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences which pause the user 
>> queues of a process before evicting something. That will be a dma_fence, but 
>> also a well known approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g. we don't use them 
>> for memory management, it can work, right? Memory management can suspend 
>> user queues anytime. It doesn't need to use DMA fences. There might be 
>> something that I'm missing here.
>>
>> What would we lose without DMA fences? Just inter-device synchronization? I 
>> think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence is before a page 
>> flip. Everything today already depends on userspace not hanging the gpu, 
>> which makes everything a future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from userspace via
>>> > memory-mapped queues ("user queues") without changing window systems.
>>> >
>>> > The memory management doesn't have to use GPU page faults like HMM.
>>> > Instead, it can wait for user queues of a specific process to go idle and
>>> > then unmap the queues, so that userspace can't submit anything. Buffer
>>> > evictions, pinning, etc. can be executed when all queues are unmapped
>>> > (suspended). Thus, no BO fences and page faults are needed.
>>> >
>>> > Inter-process synchronization can use timeline semaphores. Userspace will
>>> > query the wait and signal value for a shared buffer from the kernel. The
>>> > kernel will keep a history of those queries to know which process is
>>> > responsible for signalling which buffer. There is only the wait-timeout
>>> > issue and how to identify the culprit. One of the solutions is to have the
>>> > GPU send all GPU signal commands and all timed out wait commands via an
>>> > interrupt to the kernel driver to monitor and validate userspace behavior.
>>> > With that, it can be identified whether the culprit is the waiting process
>>> > or the signalling process and which one. Invalid signal/wait parameters 
>>> > can
>>> > also be detected. The kernel can force-signal only the semaphores that 
>>> > time
>>> > out, and punish the processes which caused the timeout or used invalid
>>> > signal/wait parameters.
>>> >
>>> > The question is whether this synchronization solution is robust enough for
>>> > dma_fence and whatever the kernel and window systems need.
>>>
>>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>>> (without page faults). That means dma_fence for synchronization is doa, at
>>> least as-is, and we're back to figuring out the winsys problem.
>>>
>>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>>> akin to saying that we're solving deadlock issues in a locking design by
>>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>>> avoids having to reach the reset button, but that's about it.
>>>
>>> And the fundamental problem is that once you throw in userspace command
>>> submission (and syncing, at least within the userspace driver, otherwise
>>> there's kinda no point if you still need the kernel for cross-engine sync)
>>> means you get deadlocks if you still use dma_fence for sync under
>>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>>
>>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>>
>>> See silly diagramm at the bottom.
>>>
>>> Now I think all isn't lost, because imo the first step to getting to this
>>> brave new world is rebuilding the driver on top of userspace fences, and
>>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>>> but port that as a context flag or similar to render nodes for gl/vk. Of
>>> course that means you can only use this mode in headless, without
>>> glx/wayland winsys support, but it's a start.
>>> -Daniel
>>>
>>> >
>>> > Marek
>>> >
>>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:
>>> >
>>> > >

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Daniel Vetter

On Tue, Apr 27, 2021 at 1:49 PM Marek Olšák  wrote:
>
> If we don't use future fences for DMA fences at all, e.g. we don't use them 
> for memory management, it can work, right? Memory management can suspend user 
> queues anytime. It doesn't need to use DMA fences. There might be something 
> that I'm missing here.

Other drivers use dma_fence for their memory management. So unles
you've converted them all over to the dma_fence/memory fence split,
dma_fence fences stay memory fences. In theory this is possible, but
maybe not if you want to complete the job this decade :-)

> What would we lose without DMA fences? Just inter-device synchronization? I 
> think that might be acceptable.
>
> The only case when the kernel will wait on a future fence is before a page 
> flip. Everything today already depends on userspace not hanging the gpu, 
> which makes everything a future fence.

That's not quite what we defined as future fences, because tdr
guarantees those complete, even if userspace hangs. It's when you put
userspace fence waits into the cs buffer you've submitted to the
kernel (or directly to hw) where the "real" future fences kick in.
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>>
>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
>> > how to do it differently.
>> >
>> > I think we can have direct command submission from userspace via
>> > memory-mapped queues ("user queues") without changing window systems.
>> >
>> > The memory management doesn't have to use GPU page faults like HMM.
>> > Instead, it can wait for user queues of a specific process to go idle and
>> > then unmap the queues, so that userspace can't submit anything. Buffer
>> > evictions, pinning, etc. can be executed when all queues are unmapped
>> > (suspended). Thus, no BO fences and page faults are needed.
>> >
>> > Inter-process synchronization can use timeline semaphores. Userspace will
>> > query the wait and signal value for a shared buffer from the kernel. The
>> > kernel will keep a history of those queries to know which process is
>> > responsible for signalling which buffer. There is only the wait-timeout
>> > issue and how to identify the culprit. One of the solutions is to have the
>> > GPU send all GPU signal commands and all timed out wait commands via an
>> > interrupt to the kernel driver to monitor and validate userspace behavior.
>> > With that, it can be identified whether the culprit is the waiting process
>> > or the signalling process and which one. Invalid signal/wait parameters can
>> > also be detected. The kernel can force-signal only the semaphores that time
>> > out, and punish the processes which caused the timeout or used invalid
>> > signal/wait parameters.
>> >
>> > The question is whether this synchronization solution is robust enough for
>> > dma_fence and whatever the kernel and window systems need.
>>
>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>> (without page faults). That means dma_fence for synchronization is doa, at
>> least as-is, and we're back to figuring out the winsys problem.
>>
>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>> akin to saying that we're solving deadlock issues in a locking design by
>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>> avoids having to reach the reset button, but that's about it.
>>
>> And the fundamental problem is that once you throw in userspace command
>> submission (and syncing, at least within the userspace driver, otherwise
>> there's kinda no point if you still need the kernel for cross-engine sync)
>> means you get deadlocks if you still use dma_fence for sync under
>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>
>> See silly diagramm at the bottom.
>>
>> Now I think all isn't lost, because imo the first step to getting to this
>> brave new world is rebuilding the driver on top of userspace fences, and
>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>> but port that as a context flag or similar to render nodes for gl/vk. Of
>> course that means you can only use this mode in headless, without
>> glx/wayland winsys support, but it's a start.
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:
>> >
>> > > Hi,
>> > >
>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
>> > >
>> > >> The thing is, you can't do this in drm/scheduler. At least not without
>> > >> splitting up the dma_fence in the kernel into separate memory fences
>> > >> and sync fences
>> > >
>> > >
>> > > I'm starting to think this thread needs its own glossary ...
>> > >
>> > > I propose we use 'residency fence' for execution fences which enact
>> > > memory-residency

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Ok. I'll interpret this as "yes, it will work, let's do it".

Marek

On Tue., Apr. 27, 2021, 08:06 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Correct, we wouldn't have synchronization between device with and without
> user queues any more.
>
> That could only be a problem for A+I Laptops.
>
> Memory management will just work with preemption fences which pause the
> user queues of a process before evicting something. That will be a
> dma_fence, but also a well known approach.
>
> Christian.
>
> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>
> If we don't use future fences for DMA fences at all, e.g. we don't use
> them for memory management, it can work, right? Memory management can
> suspend user queues anytime. It doesn't need to use DMA fences. There might
> be something that I'm missing here.
>
> What would we lose without DMA fences? Just inter-device synchronization?
> I think that might be acceptable.
>
> The only case when the kernel will wait on a future fence is before a page
> flip. Everything today already depends on userspace not hanging the gpu,
> which makes everything a future fence.
>
> Marek
>
> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>
>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> > Thanks everybody. The initial proposal is dead. Here are some thoughts
>> on
>> > how to do it differently.
>> >
>> > I think we can have direct command submission from userspace via
>> > memory-mapped queues ("user queues") without changing window systems.
>> >
>> > The memory management doesn't have to use GPU page faults like HMM.
>> > Instead, it can wait for user queues of a specific process to go idle
>> and
>> > then unmap the queues, so that userspace can't submit anything. Buffer
>> > evictions, pinning, etc. can be executed when all queues are unmapped
>> > (suspended). Thus, no BO fences and page faults are needed.
>> >
>> > Inter-process synchronization can use timeline semaphores. Userspace
>> will
>> > query the wait and signal value for a shared buffer from the kernel. The
>> > kernel will keep a history of those queries to know which process is
>> > responsible for signalling which buffer. There is only the wait-timeout
>> > issue and how to identify the culprit. One of the solutions is to have
>> the
>> > GPU send all GPU signal commands and all timed out wait commands via an
>> > interrupt to the kernel driver to monitor and validate userspace
>> behavior.
>> > With that, it can be identified whether the culprit is the waiting
>> process
>> > or the signalling process and which one. Invalid signal/wait parameters
>> can
>> > also be detected. The kernel can force-signal only the semaphores that
>> time
>> > out, and punish the processes which caused the timeout or used invalid
>> > signal/wait parameters.
>> >
>> > The question is whether this synchronization solution is robust enough
>> for
>> > dma_fence and whatever the kernel and window systems need.
>>
>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>> (without page faults). That means dma_fence for synchronization is doa, at
>> least as-is, and we're back to figuring out the winsys problem.
>>
>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>> akin to saying that we're solving deadlock issues in a locking design by
>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>> avoids having to reach the reset button, but that's about it.
>>
>> And the fundamental problem is that once you throw in userspace command
>> submission (and syncing, at least within the userspace driver, otherwise
>> there's kinda no point if you still need the kernel for cross-engine sync)
>> means you get deadlocks if you still use dma_fence for sync under
>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>
>>
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>
>> See silly diagramm at the bottom.
>>
>> Now I think all isn't lost, because imo the first step to getting to this
>> brave new world is rebuilding the driver on top of userspace fences, and
>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>> but port that as a context flag or similar to render nodes for gl/vk. Of
>> course that means you can only use this mode in headless, without
>> glx/wayland winsys support, but it's a start.
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone 
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
>> > >
>> > >> The thing is, you can't do this in drm/scheduler. At least not
>> without
>> > >> splitting up the dma_fence in the kernel into separate memory fences
>> > >> and sync fences
>> > >
>> > >
>> > > I'm starting to think this thread needs its own glossary ...
>> > >
>> > > I propose we use 'residency fence' for execution fences which enact
>> > > memory-residency

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Correct, we wouldn't have synchronization between device with and 
without user queues any more.

That could only be a problem for A+I Laptops.

Memory management will just work with preemption fences which pause the 
user queues of a process before evicting something. That will be a 
dma_fence, but also a well known approach.

Christian.

Am 27.04.21 um 13:49 schrieb Marek Olšák:
If we don't use future fences for DMA fences at all, e.g. we don't use 
them for memory management, it can work, right? Memory management can 
suspend user queues anytime. It doesn't need to use DMA fences. There 
might be something that I'm missing here.

What would we lose without DMA fences? Just inter-device 
synchronization? I think that might be acceptable.

The only case when the kernel will wait on a future fence is before a 
page flip. Everything today already depends on userspace not hanging 
the gpu, which makes everything a future fence.

Marek

On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, > wrote:

On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> Thanks everybody. The initial proposal is dead. Here are some
thoughts on
> how to do it differently.
>
> I think we can have direct command submission from userspace via
> memory-mapped queues ("user queues") without changing window
systems.
>
> The memory management doesn't have to use GPU page faults like HMM.
> Instead, it can wait for user queues of a specific process to go
idle and
> then unmap the queues, so that userspace can't submit anything.
Buffer
> evictions, pinning, etc. can be executed when all queues are
unmapped
> (suspended). Thus, no BO fences and page faults are needed.
>
> Inter-process synchronization can use timeline semaphores.
Userspace will
> query the wait and signal value for a shared buffer from the
kernel. The
> kernel will keep a history of those queries to know which process is
> responsible for signalling which buffer. There is only the
wait-timeout
> issue and how to identify the culprit. One of the solutions is
to have the
> GPU send all GPU signal commands and all timed out wait commands
via an
> interrupt to the kernel driver to monitor and validate userspace
behavior.
> With that, it can be identified whether the culprit is the
waiting process
> or the signalling process and which one. Invalid signal/wait
parameters can
> also be detected. The kernel can force-signal only the
semaphores that time
> out, and punish the processes which caused the timeout or used
invalid
> signal/wait parameters.
>
> The question is whether this synchronization solution is robust
enough for
> dma_fence and whatever the kernel and window systems need.

The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is
doa, at
least as-is, and we're back to figuring out the winsys problem.

"We'll solve it with timeouts" is very tempting, but doesn't work.
It's
akin to saying that we're solving deadlock issues in a locking
design by
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.

And the fundamental problem is that once you throw in userspace
command
submission (and syncing, at least within the userspace driver,
otherwise
there's kinda no point if you still need the kernel for
cross-engine sync)
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last
summer:

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences

See silly diagramm at the bottom.

Now I think all isn't lost, because imo the first step to getting
to this
brave new world is rebuilding the driver on top of userspace
fences, and
with the adjusted cmd submit model. You probably don't want to use
amdkfd,
but port that as a context flag or similar to render nodes for
gl/vk. Of
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel

>
> Marek
>
> On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone
mailto:dan...@fooishbar.org>> wrote:
>
> > Hi,
> >
> > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter mailto:dan...@ffwll.ch>> wrote:
> >
> >> The thing is, you can't do this in drm/scheduler. At least
not without
> >> splitting up the dma_fence in the kernel into separate memory
fences
> >> and sync fences
> >
> >
> > I'm starting to think this thread

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal