Re: [Intel-xe] [PATCH v2 4/9] drm/sched: Split free_job into own work item

2023-08-24 Thread Matthew Brost
On Thu, Aug 24, 2023 at 01:44:41PM +0200, Christian König wrote:
> Am 24.08.23 um 01:12 schrieb Matthew Brost:
> > On Wed, Aug 23, 2023 at 01:26:09PM -0400, Rodrigo Vivi wrote:
> > > On Wed, Aug 23, 2023 at 11:41:19AM -0400, Alex Deucher wrote:
> > > > On Wed, Aug 23, 2023 at 11:26 AM Matthew Brost 
> > > >  wrote:
> > > > > On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:
> > > > > > Am 23.08.23 um 05:27 schrieb Matthew Brost:
> > > > > > > [SNIP]
> > > > > > > > That is exactly what I want to avoid, tying the TDR to the job 
> > > > > > > > is what some
> > > > > > > > AMD engineers pushed for because it looked like a simple 
> > > > > > > > solution and made
> > > > > > > > the whole thing similar to what Windows does.
> > > > > > > > 
> > > > > > > > This turned the previous relatively clean scheduler and TDR 
> > > > > > > > design into a
> > > > > > > > complete nightmare. The job contains quite a bunch of things 
> > > > > > > > which are not
> > > > > > > > necessarily available after the application which submitted the 
> > > > > > > > job is torn
> > > > > > > > down.
> > > > > > > > 
> > > > > > > Agree the TDR shouldn't be accessing anything application specific
> > > > > > > rather just internal job state required to tear the job down on 
> > > > > > > the
> > > > > > > hardware.
> > > > > > > > So what happens is that you either have stale pointers in the 
> > > > > > > > TDR which can
> > > > > > > > go boom extremely easily or we somehow find a way to keep the 
> > > > > > > > necessary
> > > > > > > I have not experenced the TDR going boom in Xe.
> > > > > > > 
> > > > > > > > structures (which include struct thread_info and struct file 
> > > > > > > > for this driver
> > > > > > > > connection) alive until all submissions are completed.
> > > > > > > > 
> > > > > > > In Xe we keep everything alive until all submissions are 
> > > > > > > completed. By
> > > > > > > everything I mean the drm job, entity, scheduler, and VM via a 
> > > > > > > reference
> > > > > > > counting scheme. All of these structures are just kernel state 
> > > > > > > which can
> > > > > > > safely be accessed even if the application has been killed.
> > > > > > Yeah, but that might just not be such a good idea from memory 
> > > > > > management
> > > > > > point of view.
> > > > > > 
> > > > > > When you (for example) kill a process all resource from that 
> > > > > > progress should
> > > > > > at least be queued to be freed more or less immediately.
> > > > > > 
> > > > > We do this, the TDR kicks jobs off the hardware as fast as the hw
> > > > > interface allows and signals all pending hw fences immediately after.
> > > > > Free jobs then is immediately called and the reference count goes to
> > > > > zero. I think max time for all of this to occur is a handful of ms.
> > > > > 
> > > > > > What Linux is doing for other I/O operations is to keep the 
> > > > > > relevant pages
> > > > > > alive until the I/O operation is completed, but for GPUs that 
> > > > > > usually means
> > > > > > keeping most of the memory of the process alive and that in turn is 
> > > > > > really
> > > > > > not something you can do.
> > > > > > 
> > > > > > You can of course do this if your driver has a reliable way of 
> > > > > > killing your
> > > > > > submissions and freeing resources in a reasonable amount of time. 
> > > > > > This
> > > > > > should then be done in the flush callback.
> > > > > > 
> > > > > 'flush callback' - Do you mean drm_sched_entity_flush? I looked at 
> > > > > that
> > > > > and think that function doesn't even work for what I tell. It flushes
> > > > > the spsc queue but what about jobs on the hardware, how do those get
> > > > > killed?
> > > > > 
> > > > > As stated we do via the TDR which is rather clean design and fits with
> > > > > our reference couting scheme.
> > > > > 
> > > > > > > If we need to teardown on demand we just set the TDR to a minimum 
> > > > > > > value and
> > > > > > > it kicks the jobs off the hardware, gracefully cleans everything 
> > > > > > > up and
> > > > > > > drops all references. This is a benefit of the 1 to 1 
> > > > > > > relationship, not
> > > > > > > sure if this works with how AMDGPU uses the scheduler.
> > > > > > > 
> > > > > > > > Delaying application tear down is also not an option because 
> > > > > > > > then you run
> > > > > > > > into massive trouble with the OOM killer (or more generally OOM 
> > > > > > > > handling).
> > > > > > > > See what we do in drm_sched_entity_flush() as well.
> > > > > > > > 
> > > > > > > Not an issue for Xe, we never call drm_sched_entity_flush as our
> > > > > > > referencing counting scheme is all jobs are finished before we 
> > > > > > > attempt
> > > > > > > to tear down entity / scheduler.
> > > > > > I don't think you can do that upstream. Calling 
> > > > > > drm_sched_entity_flush() is
> > > > > > a must have from your flush callback for the file descriptor.
> > > > > > 
> > > > > Again 'flus

Re: [Intel-xe] [PATCH v2 4/9] drm/sched: Split free_job into own work item

2023-08-24 Thread Christian König

Am 24.08.23 um 01:12 schrieb Matthew Brost:

On Wed, Aug 23, 2023 at 01:26:09PM -0400, Rodrigo Vivi wrote:

On Wed, Aug 23, 2023 at 11:41:19AM -0400, Alex Deucher wrote:

On Wed, Aug 23, 2023 at 11:26 AM Matthew Brost  wrote:

On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:

Am 23.08.23 um 05:27 schrieb Matthew Brost:

[SNIP]

That is exactly what I want to avoid, tying the TDR to the job is what some
AMD engineers pushed for because it looked like a simple solution and made
the whole thing similar to what Windows does.

This turned the previous relatively clean scheduler and TDR design into a
complete nightmare. The job contains quite a bunch of things which are not
necessarily available after the application which submitted the job is torn
down.


Agree the TDR shouldn't be accessing anything application specific
rather just internal job state required to tear the job down on the
hardware.

So what happens is that you either have stale pointers in the TDR which can
go boom extremely easily or we somehow find a way to keep the necessary

I have not experenced the TDR going boom in Xe.


structures (which include struct thread_info and struct file for this driver
connection) alive until all submissions are completed.


In Xe we keep everything alive until all submissions are completed. By
everything I mean the drm job, entity, scheduler, and VM via a reference
counting scheme. All of these structures are just kernel state which can
safely be accessed even if the application has been killed.

Yeah, but that might just not be such a good idea from memory management
point of view.

When you (for example) kill a process all resource from that progress should
at least be queued to be freed more or less immediately.


We do this, the TDR kicks jobs off the hardware as fast as the hw
interface allows and signals all pending hw fences immediately after.
Free jobs then is immediately called and the reference count goes to
zero. I think max time for all of this to occur is a handful of ms.


What Linux is doing for other I/O operations is to keep the relevant pages
alive until the I/O operation is completed, but for GPUs that usually means
keeping most of the memory of the process alive and that in turn is really
not something you can do.

You can of course do this if your driver has a reliable way of killing your
submissions and freeing resources in a reasonable amount of time. This
should then be done in the flush callback.


'flush callback' - Do you mean drm_sched_entity_flush? I looked at that
and think that function doesn't even work for what I tell. It flushes
the spsc queue but what about jobs on the hardware, how do those get
killed?

As stated we do via the TDR which is rather clean design and fits with
our reference couting scheme.


If we need to teardown on demand we just set the TDR to a minimum value and
it kicks the jobs off the hardware, gracefully cleans everything up and
drops all references. This is a benefit of the 1 to 1 relationship, not
sure if this works with how AMDGPU uses the scheduler.


Delaying application tear down is also not an option because then you run
into massive trouble with the OOM killer (or more generally OOM handling).
See what we do in drm_sched_entity_flush() as well.


Not an issue for Xe, we never call drm_sched_entity_flush as our
referencing counting scheme is all jobs are finished before we attempt
to tear down entity / scheduler.

I don't think you can do that upstream. Calling drm_sched_entity_flush() is
a must have from your flush callback for the file descriptor.


Again 'flush callback'? What are you refering too.

And why does drm_sched_entity_flush need to be called, doesn't seem to
do anything useful.


Unless you have some other method for killing your submissions this would
give a path for a deny of service attack vector when the Xe driver is in
use.


Yes, once th TDR fires is disallows all new submissions at the exec
IOCTL plus flushes any pending submissions as fast as possible.


Since adding the TDR support we completely exercised this through in the
last two or three years or so. And to sum it up I would really like to get
away from this mess again.

Compared to that what i915 does is actually rather clean I think.


Not even close, resets where a nightmare in the i915 (I spend years
trying to get this right and probably still completely work) and in Xe
basically got it right on the attempt.


Also in Xe some of
things done in free_job cannot be from an IRQ context, hence calling
this from the scheduler worker is rather helpful.

Well putting things for cleanup into a workitem doesn't sounds like
something hard.


That is exactly what we doing in the scheduler with the free_job
workitem.

Yeah, but I think that we do it in the scheduler and not the driver is
problematic.

Christian, I do see your point on simply get rid of free job callbacks here
then use fence with own-driver workqueue and house cleaning. But I wonder if
starti

Re: [Intel-xe] [PATCH v2 4/9] drm/sched: Split free_job into own work item

2023-08-23 Thread Matthew Brost
On Wed, Aug 23, 2023 at 01:26:09PM -0400, Rodrigo Vivi wrote:
> On Wed, Aug 23, 2023 at 11:41:19AM -0400, Alex Deucher wrote:
> > On Wed, Aug 23, 2023 at 11:26 AM Matthew Brost  
> > wrote:
> > >
> > > On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:
> > > > Am 23.08.23 um 05:27 schrieb Matthew Brost:
> > > > > [SNIP]
> > > > > > That is exactly what I want to avoid, tying the TDR to the job is 
> > > > > > what some
> > > > > > AMD engineers pushed for because it looked like a simple solution 
> > > > > > and made
> > > > > > the whole thing similar to what Windows does.
> > > > > >
> > > > > > This turned the previous relatively clean scheduler and TDR design 
> > > > > > into a
> > > > > > complete nightmare. The job contains quite a bunch of things which 
> > > > > > are not
> > > > > > necessarily available after the application which submitted the job 
> > > > > > is torn
> > > > > > down.
> > > > > >
> > > > > Agree the TDR shouldn't be accessing anything application specific
> > > > > rather just internal job state required to tear the job down on the
> > > > > hardware.
> > > > > > So what happens is that you either have stale pointers in the TDR 
> > > > > > which can
> > > > > > go boom extremely easily or we somehow find a way to keep the 
> > > > > > necessary
> > > > > I have not experenced the TDR going boom in Xe.
> > > > >
> > > > > > structures (which include struct thread_info and struct file for 
> > > > > > this driver
> > > > > > connection) alive until all submissions are completed.
> > > > > >
> > > > > In Xe we keep everything alive until all submissions are completed. By
> > > > > everything I mean the drm job, entity, scheduler, and VM via a 
> > > > > reference
> > > > > counting scheme. All of these structures are just kernel state which 
> > > > > can
> > > > > safely be accessed even if the application has been killed.
> > > >
> > > > Yeah, but that might just not be such a good idea from memory management
> > > > point of view.
> > > >
> > > > When you (for example) kill a process all resource from that progress 
> > > > should
> > > > at least be queued to be freed more or less immediately.
> > > >
> > >
> > > We do this, the TDR kicks jobs off the hardware as fast as the hw
> > > interface allows and signals all pending hw fences immediately after.
> > > Free jobs then is immediately called and the reference count goes to
> > > zero. I think max time for all of this to occur is a handful of ms.
> > >
> > > > What Linux is doing for other I/O operations is to keep the relevant 
> > > > pages
> > > > alive until the I/O operation is completed, but for GPUs that usually 
> > > > means
> > > > keeping most of the memory of the process alive and that in turn is 
> > > > really
> > > > not something you can do.
> > > >
> > > > You can of course do this if your driver has a reliable way of killing 
> > > > your
> > > > submissions and freeing resources in a reasonable amount of time. This
> > > > should then be done in the flush callback.
> > > >
> > >
> > > 'flush callback' - Do you mean drm_sched_entity_flush? I looked at that
> > > and think that function doesn't even work for what I tell. It flushes
> > > the spsc queue but what about jobs on the hardware, how do those get
> > > killed?
> > >
> > > As stated we do via the TDR which is rather clean design and fits with
> > > our reference couting scheme.
> > >
> > > > > If we need to teardown on demand we just set the TDR to a minimum 
> > > > > value and
> > > > > it kicks the jobs off the hardware, gracefully cleans everything up 
> > > > > and
> > > > > drops all references. This is a benefit of the 1 to 1 relationship, 
> > > > > not
> > > > > sure if this works with how AMDGPU uses the scheduler.
> > > > >
> > > > > > Delaying application tear down is also not an option because then 
> > > > > > you run
> > > > > > into massive trouble with the OOM killer (or more generally OOM 
> > > > > > handling).
> > > > > > See what we do in drm_sched_entity_flush() as well.
> > > > > >
> > > > > Not an issue for Xe, we never call drm_sched_entity_flush as our
> > > > > referencing counting scheme is all jobs are finished before we attempt
> > > > > to tear down entity / scheduler.
> > > >
> > > > I don't think you can do that upstream. Calling 
> > > > drm_sched_entity_flush() is
> > > > a must have from your flush callback for the file descriptor.
> > > >
> > >
> > > Again 'flush callback'? What are you refering too.
> > >
> > > And why does drm_sched_entity_flush need to be called, doesn't seem to
> > > do anything useful.
> > >
> > > > Unless you have some other method for killing your submissions this 
> > > > would
> > > > give a path for a deny of service attack vector when the Xe driver is in
> > > > use.
> > > >
> > >
> > > Yes, once th TDR fires is disallows all new submissions at the exec
> > > IOCTL plus flushes any pending submissions as fast as possible.
> > >
> > > > > > Since adding the TDR support we

Re: [Intel-xe] [PATCH v2 4/9] drm/sched: Split free_job into own work item

2023-08-23 Thread Rodrigo Vivi
On Wed, Aug 23, 2023 at 11:41:19AM -0400, Alex Deucher wrote:
> On Wed, Aug 23, 2023 at 11:26 AM Matthew Brost  
> wrote:
> >
> > On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:
> > > Am 23.08.23 um 05:27 schrieb Matthew Brost:
> > > > [SNIP]
> > > > > That is exactly what I want to avoid, tying the TDR to the job is 
> > > > > what some
> > > > > AMD engineers pushed for because it looked like a simple solution and 
> > > > > made
> > > > > the whole thing similar to what Windows does.
> > > > >
> > > > > This turned the previous relatively clean scheduler and TDR design 
> > > > > into a
> > > > > complete nightmare. The job contains quite a bunch of things which 
> > > > > are not
> > > > > necessarily available after the application which submitted the job 
> > > > > is torn
> > > > > down.
> > > > >
> > > > Agree the TDR shouldn't be accessing anything application specific
> > > > rather just internal job state required to tear the job down on the
> > > > hardware.
> > > > > So what happens is that you either have stale pointers in the TDR 
> > > > > which can
> > > > > go boom extremely easily or we somehow find a way to keep the 
> > > > > necessary
> > > > I have not experenced the TDR going boom in Xe.
> > > >
> > > > > structures (which include struct thread_info and struct file for this 
> > > > > driver
> > > > > connection) alive until all submissions are completed.
> > > > >
> > > > In Xe we keep everything alive until all submissions are completed. By
> > > > everything I mean the drm job, entity, scheduler, and VM via a reference
> > > > counting scheme. All of these structures are just kernel state which can
> > > > safely be accessed even if the application has been killed.
> > >
> > > Yeah, but that might just not be such a good idea from memory management
> > > point of view.
> > >
> > > When you (for example) kill a process all resource from that progress 
> > > should
> > > at least be queued to be freed more or less immediately.
> > >
> >
> > We do this, the TDR kicks jobs off the hardware as fast as the hw
> > interface allows and signals all pending hw fences immediately after.
> > Free jobs then is immediately called and the reference count goes to
> > zero. I think max time for all of this to occur is a handful of ms.
> >
> > > What Linux is doing for other I/O operations is to keep the relevant pages
> > > alive until the I/O operation is completed, but for GPUs that usually 
> > > means
> > > keeping most of the memory of the process alive and that in turn is really
> > > not something you can do.
> > >
> > > You can of course do this if your driver has a reliable way of killing 
> > > your
> > > submissions and freeing resources in a reasonable amount of time. This
> > > should then be done in the flush callback.
> > >
> >
> > 'flush callback' - Do you mean drm_sched_entity_flush? I looked at that
> > and think that function doesn't even work for what I tell. It flushes
> > the spsc queue but what about jobs on the hardware, how do those get
> > killed?
> >
> > As stated we do via the TDR which is rather clean design and fits with
> > our reference couting scheme.
> >
> > > > If we need to teardown on demand we just set the TDR to a minimum value 
> > > > and
> > > > it kicks the jobs off the hardware, gracefully cleans everything up and
> > > > drops all references. This is a benefit of the 1 to 1 relationship, not
> > > > sure if this works with how AMDGPU uses the scheduler.
> > > >
> > > > > Delaying application tear down is also not an option because then you 
> > > > > run
> > > > > into massive trouble with the OOM killer (or more generally OOM 
> > > > > handling).
> > > > > See what we do in drm_sched_entity_flush() as well.
> > > > >
> > > > Not an issue for Xe, we never call drm_sched_entity_flush as our
> > > > referencing counting scheme is all jobs are finished before we attempt
> > > > to tear down entity / scheduler.
> > >
> > > I don't think you can do that upstream. Calling drm_sched_entity_flush() 
> > > is
> > > a must have from your flush callback for the file descriptor.
> > >
> >
> > Again 'flush callback'? What are you refering too.
> >
> > And why does drm_sched_entity_flush need to be called, doesn't seem to
> > do anything useful.
> >
> > > Unless you have some other method for killing your submissions this would
> > > give a path for a deny of service attack vector when the Xe driver is in
> > > use.
> > >
> >
> > Yes, once th TDR fires is disallows all new submissions at the exec
> > IOCTL plus flushes any pending submissions as fast as possible.
> >
> > > > > Since adding the TDR support we completely exercised this through in 
> > > > > the
> > > > > last two or three years or so. And to sum it up I would really like 
> > > > > to get
> > > > > away from this mess again.
> > > > >
> > > > > Compared to that what i915 does is actually rather clean I think.
> > > > >
> > > > Not even close, resets where a nightmare in the i915 (I s