Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-16 Thread Andrey Grodzovsky


On 2022-05-16 11:08, Christian König wrote:

Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky:


Ping



Ah, yes sorry.


Andrey

On 2022-05-13 11:41, Andrey Grodzovsky wrote:

Yes, exactly that's the idea.

Basically the reset domain knowns which amdgpu devices it needs to 
reset together.


If you then represent that so that you always have a hive even when 
you only have one device in it, or if you put an array of devices 
which needs to be reset together into the reset domain doesn't matter.


Maybe go for the later approach, that is probably a bit cleaner and 
less code to change.


Christian.



Unfortunately this approach raises also a few  difficulties -
First - if holding array of devices in reset_domain then when you 
come to GPU reset function you don't really know which adev is the 
one triggered the reset and this is actually essential to some 
procedures like emergency restart.


What is "emergency restart"? That's not some requirement I know about.



Emergency restart is something u can see at the beginning of 
amdgpu_gpu_recover function - it's a historical work around for some 
type of ASICs who weren't able to do full reset I think.  We should 
eventually remove it bu for now I thin it's still in use.







Second - in XGMI case we must take into account that one of the hive 
members might go away in runtime (i could do echo 1 > 
/sysfs/pci_id/remove on it for example at any moment) - so now we 
need to maintain this array and mark such entry with NULL probably 
on XGMI node removal , and then there might be hot insertion and all 
this adds more complications.


I now tend to prefer your initial solution for it's simplicity and 
the result will be what we need -


"E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
"


Works for me. I already expected that switching over the reset to be 
based on the reset context wouldn't be that easy.


Regards,
Christian.



Ok - i will resend a patch.

Andrey






Let me know what you think.

Andrey 


Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-16 Thread Christian König

Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky:


Ping



Ah, yes sorry.


Andrey

On 2022-05-13 11:41, Andrey Grodzovsky wrote:

Yes, exactly that's the idea.

Basically the reset domain knowns which amdgpu devices it needs to 
reset together.


If you then represent that so that you always have a hive even when 
you only have one device in it, or if you put an array of devices 
which needs to be reset together into the reset domain doesn't matter.


Maybe go for the later approach, that is probably a bit cleaner and 
less code to change.


Christian.



Unfortunately this approach raises also a few  difficulties -
First - if holding array of devices in reset_domain then when you 
come to GPU reset function you don't really know which adev is the 
one triggered the reset and this is actually essential to some 
procedures like emergency restart.


What is "emergency restart"? That's not some requirement I know about.



Second - in XGMI case we must take into account that one of the hive 
members might go away in runtime (i could do echo 1 > 
/sysfs/pci_id/remove on it for example at any moment) - so now we 
need to maintain this array and mark such entry with NULL probably on 
XGMI node removal , and then there might be hot insertion and all 
this adds more complications.


I now tend to prefer your initial solution for it's simplicity and 
the result will be what we need -


"E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
"


Works for me. I already expected that switching over the reset to be 
based on the reset context wouldn't be that easy.


Regards,
Christian.



Let me know what you think.

Andrey 


Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-16 Thread Andrey Grodzovsky

Ping

Andrey

On 2022-05-13 11:41, Andrey Grodzovsky wrote:

Yes, exactly that's the idea.

Basically the reset domain knowns which amdgpu devices it needs to 
reset together.


If you then represent that so that you always have a hive even when 
you only have one device in it, or if you put an array of devices 
which needs to be reset together into the reset domain doesn't matter.


Maybe go for the later approach, that is probably a bit cleaner and 
less code to change.


Christian.



Unfortunately this approach raises also a few  difficulties -
First - if holding array of devices in reset_domain then when you come 
to GPU reset function you don't really know which adev is the one 
triggered the reset and this is actually essential to some procedures 
like emergency restart.


Second - in XGMI case we must take into account that one of the hive 
members might go away in runtime (i could do echo 1 > 
/sysfs/pci_id/remove on it for example at any moment) - so now we need 
to maintain this array and mark such entry with NULL probably on XGMI 
node removal , and then there might be hot insertion and all this adds 
more complications.


I now tend to prefer your initial solution for it's simplicity and the 
result will be what we need -


"E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
"

Let me know what you think.

Andrey 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-13 Thread Andrey Grodzovsky



On 2022-05-12 09:15, Christian König wrote:

Am 12.05.22 um 15:07 schrieb Andrey Grodzovsky:


On 2022-05-12 02:06, Christian König wrote:

Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky:


On 2022-05-11 11:39, Christian König wrote:

Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:

On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired 
and which hasn't, because that would be racy again. Instead 
just bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add 
a fix to only insert reset work to pending reset list if 
it's not already there), the point of using list (or array) 
to which you add and from which you remove is that the logic 
of this is encapsulated within reset domain. In your way we 
need to be aware who exactly schedules reset work and 
explicitly cancel them, this also means that for any new 
source added in the future you will need to remember to add him


I don't think that this is a valid argument. Additionally to 
the schedulers we probably just need less than a handful of 
reset sources, most likely even just one or two is enough.


The only justification I can see of having additional 
separate reset sources would be if somebody wants to know if 
a specific source has been handled or not (e.g. call 
flush_work() or work_pending()). Like in the case of a reset 
triggered through debugfs.



This is indeed one reason, another is as we said before that 
if you share 'reset source' (meaning a delayed work) with 
another client (i.e. RAS and KFD) it means you make assumption 
that the other client always proceeds with the
reset exactly the same way as you expect. So today we have 
this only in scheduler vs non scheduler reset happening - non 
scheduler reset clients assume the reset is always fully 
executed in HW while scheduler based reset makes shortcuts and 
not always does HW reset hence they cannot share 'reset 
source' (delayed work). Yes, we can always add this in the 
future if and when such problem will arise but no one will 
remember this then and a new bug will be introduced and will 
take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle 
the new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index 
for each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain 
later for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU 
reset and so I cannot store actual work items in the array 
mentioned above
but rather pointers to work_item because i need a way to get back 
the return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


You should try to avoid that as well.

See when the debugfs reset is canceled because of a scheduler 
reset you won't get a useful return code either.


What we should do instead is to cache the status of the last reset 
in the reset domain.


Regards,
Christian.



Another problem with this approach -  to execute  the actaul GPU 
reset I need accesses  to concrete amdgpu_device pointer from work 
struct (see xgpu_ai_mailbox_flr_work) as example. If i store all 
work items in
array in amdgpu_reset_domain the most i can only retrieve is the 
reset_domain struct itself which won't help since it's dynamically 
allocated, not embedded in hive or adev and can can be one per 
device or per entire hive in case of XGMI and so there is no way 
for me to reach back to amdgpu_device. Back pointer to adev* from 
amdgpu_reset_domain will only work for single device but not for 
XGMI hive where there are multiple devices in a hive.


Which is exactly the reason why I think we should always allocate 
the hive structure, even if we only have one device. And a GPU reset 
should then always work with the hive data structure and not adev.



I am not sure why HIVE is the object we should work with, hive 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Andrey Grodzovsky
Sure, I will investigate that. What about the ticket which LIjo raised 
which was basically doing 8 resets instead of one  ? Lijo - can this 
ticket wait until I come up with this new design for amdgpu reset 
function or u need a quick solution now in which case we can use the 
already existing patch temporary.


Andrey

On 2022-05-12 09:15, Christian König wrote:
I am not sure why HIVE is the object we should work with, hive is one 
use case, single device is another, then Lijo described something 
called partition which is what ? Particular pipe within GPU ?. What 
they all share in common
IMHO is that all of them use reset domain when they want a recovery 
operation, so maybe GPU reset should be oriented to work with reset 
domain ?


Yes, exactly that's the idea.

Basically the reset domain knowns which amdgpu devices it needs to 
reset together.


If you then represent that so that you always have a hive even when 
you only have one device in it, or if you put an array of devices 
which needs to be reset together into the reset domain doesn't matter.


Maybe go for the later approach, that is probably a bit cleaner and 
less code to change.


Christian. 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Christian König

Am 12.05.22 um 15:07 schrieb Andrey Grodzovsky:


On 2022-05-12 02:06, Christian König wrote:

Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky:


On 2022-05-11 11:39, Christian König wrote:

Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:

On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired 
and which hasn't, because that would be racy again. Instead 
just bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add 
a fix to only insert reset work to pending reset list if it's 
not already there), the point of using list (or array) to 
which you add and from which you remove is that the logic of 
this is encapsulated within reset domain. In your way we need 
to be aware who exactly schedules reset work and explicitly 
cancel them, this also means that for any new source added in 
the future you will need to remember to add him


I don't think that this is a valid argument. Additionally to 
the schedulers we probably just need less than a handful of 
reset sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if 
you share 'reset source' (meaning a delayed work) with another 
client (i.e. RAS and KFD) it means you make assumption that the 
other client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non 
scheduler reset clients assume the reset is always fully 
executed in HW while scheduler based reset makes shortcuts and 
not always does HW reset hence they cannot share 'reset source' 
(delayed work). Yes, we can always add this in the future if 
and when such problem will arise but no one will remember this 
then and a new bug will be introduced and will take time to 
find and resolve.


Mhm, so your main concern is that we forget to correctly handle 
the new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index 
for each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain 
later for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back 
the return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


You should try to avoid that as well.

See when the debugfs reset is canceled because of a scheduler reset 
you won't get a useful return code either.


What we should do instead is to cache the status of the last reset 
in the reset domain.


Regards,
Christian.



Another problem with this approach -  to execute  the actaul GPU 
reset I need accesses  to concrete amdgpu_device pointer from work 
struct (see xgpu_ai_mailbox_flr_work) as example. If i store all 
work items in
array in amdgpu_reset_domain the most i can only retrieve is the 
reset_domain struct itself which won't help since it's dynamically 
allocated, not embedded in hive or adev and can can be one per 
device or per entire hive in case of XGMI and so there is no way for 
me to reach back to amdgpu_device. Back pointer to adev* from 
amdgpu_reset_domain will only work for single device but not for 
XGMI hive where there are multiple devices in a hive.


Which is exactly the reason why I think we should always allocate the 
hive structure, even if we only have one device. And a GPU reset 
should then always work with the hive data structure and not adev.



I am not sure why HIVE is the object we should work with, hive is one 
use case, single device is another, then 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Andrey Grodzovsky



On 2022-05-12 02:06, Christian König wrote:

Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky:


On 2022-05-11 11:39, Christian König wrote:

Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:

On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired 
and which hasn't, because that would be racy again. Instead 
just bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a 
fix to only insert reset work to pending reset list if it's 
not already there), the point of using list (or array) to 
which you add and from which you remove is that the logic of 
this is encapsulated within reset domain. In your way we need 
to be aware who exactly schedules reset work and explicitly 
cancel them, this also means that for any new source added in 
the future you will need to remember to add him


I don't think that this is a valid argument. Additionally to 
the schedulers we probably just need less than a handful of 
reset sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if 
you share 'reset source' (meaning a delayed work) with another 
client (i.e. RAS and KFD) it means you make assumption that the 
other client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non 
scheduler reset clients assume the reset is always fully 
executed in HW while scheduler based reset makes shortcuts and 
not always does HW reset hence they cannot share 'reset source' 
(delayed work). Yes, we can always add this in the future if and 
when such problem will arise but no one will remember this then 
and a new bug will be introduced and will take time to find and 
resolve.


Mhm, so your main concern is that we forget to correctly handle 
the new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index 
for each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain 
later for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back 
the return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


You should try to avoid that as well.

See when the debugfs reset is canceled because of a scheduler reset 
you won't get a useful return code either.


What we should do instead is to cache the status of the last reset 
in the reset domain.


Regards,
Christian.



Another problem with this approach -  to execute  the actaul GPU 
reset I need accesses  to concrete amdgpu_device pointer from work 
struct (see xgpu_ai_mailbox_flr_work) as example. If i store all work 
items in
array in amdgpu_reset_domain the most i can only retrieve is the 
reset_domain struct itself which won't help since it's dynamically 
allocated, not embedded in hive or adev and can can be one per device 
or per entire hive in case of XGMI and so there is no way for me to 
reach back to amdgpu_device. Back pointer to adev* from 
amdgpu_reset_domain will only work for single device but not for XGMI 
hive where there are multiple devices in a hive.


Which is exactly the reason why I think we should always allocate the 
hive structure, even if we only have one device. And a GPU reset 
should then always work with the hive data structure and not adev.



I am not sure why HIVE is the object we should work with, hive is one 
use case, single device is another, then Lijo described something called 
partition which 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Andrey Grodzovsky



On 2022-05-12 02:03, Christian König wrote:

Am 11.05.22 um 17:57 schrieb Andrey Grodzovsky:

[SNIP]

How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index 
for each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain 
later for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back 
the return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


You should try to avoid that as well.



Why ?


Because pointers need a lifetime. When the work items are allocated as 
part of the structure you can be sure that they are not freed not 
reused for something else.






What we should do instead is to cache the status of the last reset 
in the reset domain.



Probably an atomic ret in reset_domain then.

Another 2 points i forgot to ask -

1) What race condition you had in mind about insertion and extraction 
from the list if all is done under lock ?


A work item is essentially a linked list and a function to call when 
the worker thread has time to process the item. This means you now 
have two linked lists representing essentially the same.


To make it even worse those two lists are protected by two different 
locks. The work item list is protected by the worker lock and the 
reset item by our own.


Keeping all that synced up is rather racy.



I really don't see a problem as the locking order is always preserved so 
that the reset list lock scope is around the internal worker thread list 
lock.







2) This I asked already - why you opposed so much to allocation on 
the stack ? I understand the problem with dynamic memory allocations 
but why stack ? We do multiple allocations on the stack for any function
we call during GPU reset, what so special in this case where we 
allocate work struct on the stack? It's not like the work struct is 
especially big compared to other stuff we allocate on the stack.


The problem is once more the lifetime. When the reset work items are 
allocated on the stack we absolutely need to make sure that their 
pointer is not flying around anywhere when the function returns. Keep 
in mind that a stack is just another form of memory management.


It's certainly possible to get that right, but it's just the more 
error prone approach.



I believe i got that right as at any place where the reset_work item 
life cycle ends i was always calling pending works list delete to not 
leave dangling pointers.


I guess it's all pros and cons, Yes, it' maybe more risky this way since 
it creates a risk of dangling pointers if forgetting to remove but on 
the other hand this approach in my opinion gives cleaner and more 
generic design then
the other alternative as you can see from the other problem I 
encountered with not being able to retrieve amdgpu_device pointer which 
we discuss in another thread and having to explicitly name each reset 
source in reset domain.


Andrey




Regards,
Christian.



Andrey




Regards,
Christian.



Andrey




Not 100% sure if that works, but something like that should do 
the trick.


My main concern is that I don't want to allocate the work items 
on the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array 
or other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current 
way all this done automatically within reset_domain code and 
it's agnostic to specific driver and it's specific list of 
reset sources. Also in case we would want to generalize 
reset_domain to other GPU drivers (which was
a plan as far as i remember) this explicit mention of each 
reset works for cancellation is again less suitable in my 
opinion.


Well we could put the work item for the scheduler independent 
reset source into the reset domain as well. But I'm not sure 
those additional reset sources should be part of any common 
handling, that is largely amdgpu specific.



So it's for sure more then one source for the reasons described 
above, also note that for scheduler we already cancel delayed 
work in drm_sched_stop so calling them again in amdgpu code kind 
of superfluous.


Andrey




Christian.



Andrey






Andrey





Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Lazar, Lijo




On 5/12/2022 11:36 AM, Christian König wrote:

Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky:


On 2022-05-11 11:39, Christian König wrote:

Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:

On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired 
and which hasn't, because that would be racy again. Instead 
just bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a 
fix to only insert reset work to pending reset list if it's not 
already there), the point of using list (or array) to which you 
add and from which you remove is that the logic of this is 
encapsulated within reset domain. In your way we need to be 
aware who exactly schedules reset work and explicitly cancel 
them, this also means that for any new source added in the 
future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if 
you share 'reset source' (meaning a delayed work) with another 
client (i.e. RAS and KFD) it means you make assumption that the 
other client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non 
scheduler reset clients assume the reset is always fully executed 
in HW while scheduler based reset makes shortcuts and not always 
does HW reset hence they cannot share 'reset source' (delayed 
work). Yes, we can always add this in the future if and when such 
problem will arise but no one will remember this then and a new 
bug will be introduced and will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle 
the new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index 
for each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain 
later for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back 
the return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


You should try to avoid that as well.

See when the debugfs reset is canceled because of a scheduler reset 
you won't get a useful return code either.


What we should do instead is to cache the status of the last reset in 
the reset domain.


Regards,
Christian.



Another problem with this approach -  to execute  the actaul GPU reset 
I need accesses  to concrete amdgpu_device pointer from work struct 
(see xgpu_ai_mailbox_flr_work) as example. If i store all work items in
array in amdgpu_reset_domain the most i can only retrieve is the 
reset_domain struct itself which won't help since it's dynamically 
allocated, not embedded in hive or adev and can can be one per device 
or per entire hive in case of XGMI and so there is no way for me to 
reach back to amdgpu_device. Back pointer to adev* from 
amdgpu_reset_domain will only work for single device but not for XGMI 
hive where there are multiple devices in a hive.


You could embed the work within another struct which provides the 
context of the work and have that as the array.


struct amdgpu_reset_job {
void *reset_context; // A struct containing the reset context, 
you may check amdgpu_reset_context.

struct work_struct reset_work;
unsigned long flags;
};

And keep an array of 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Christian König

Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky:


On 2022-05-11 11:39, Christian König wrote:

Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:

On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired 
and which hasn't, because that would be racy again. Instead 
just bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a 
fix to only insert reset work to pending reset list if it's not 
already there), the point of using list (or array) to which you 
add and from which you remove is that the logic of this is 
encapsulated within reset domain. In your way we need to be 
aware who exactly schedules reset work and explicitly cancel 
them, this also means that for any new source added in the 
future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if 
you share 'reset source' (meaning a delayed work) with another 
client (i.e. RAS and KFD) it means you make assumption that the 
other client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non 
scheduler reset clients assume the reset is always fully executed 
in HW while scheduler based reset makes shortcuts and not always 
does HW reset hence they cannot share 'reset source' (delayed 
work). Yes, we can always add this in the future if and when such 
problem will arise but no one will remember this then and a new 
bug will be introduced and will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle 
the new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index 
for each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain 
later for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back 
the return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


You should try to avoid that as well.

See when the debugfs reset is canceled because of a scheduler reset 
you won't get a useful return code either.


What we should do instead is to cache the status of the last reset in 
the reset domain.


Regards,
Christian.



Another problem with this approach -  to execute  the actaul GPU reset 
I need accesses  to concrete amdgpu_device pointer from work struct 
(see xgpu_ai_mailbox_flr_work) as example. If i store all work items in
array in amdgpu_reset_domain the most i can only retrieve is the 
reset_domain struct itself which won't help since it's dynamically 
allocated, not embedded in hive or adev and can can be one per device 
or per entire hive in case of XGMI and so there is no way for me to 
reach back to amdgpu_device. Back pointer to adev* from 
amdgpu_reset_domain will only work for single device but not for XGMI 
hive where there are multiple devices in a hive.


Which is exactly the reason why I think we should always allocate the 
hive structure, even if we only have one device. And a GPU reset should 
then always work with the hive data structure and not adev.


Adding a pointer from your reset work item back to the hive is then trivial.

Regards,
Christian.



Andrey






Andrey




Not 100% sure if that works, but something like that should do the 
trick.


My 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Christian König

Am 11.05.22 um 17:57 schrieb Andrey Grodzovsky:

[SNIP]

How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index 
for each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain 
later for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back 
the return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


You should try to avoid that as well.



Why ?


Because pointers need a lifetime. When the work items are allocated as 
part of the structure you can be sure that they are not freed not reused 
for something else.






What we should do instead is to cache the status of the last reset in 
the reset domain.



Probably an atomic ret in reset_domain then.

Another 2 points i forgot to ask -

1) What race condition you had in mind about insertion and extraction 
from the list if all is done under lock ?


A work item is essentially a linked list and a function to call when the 
worker thread has time to process the item. This means you now have two 
linked lists representing essentially the same.


To make it even worse those two lists are protected by two different 
locks. The work item list is protected by the worker lock and the reset 
item by our own.


Keeping all that synced up is rather racy.



2) This I asked already - why you opposed so much to allocation on the 
stack ? I understand the problem with dynamic memory allocations but 
why stack ? We do multiple allocations on the stack for any function
we call during GPU reset, what so special in this case where we 
allocate work struct on the stack? It's not like the work struct is 
especially big compared to other stuff we allocate on the stack.


The problem is once more the lifetime. When the reset work items are 
allocated on the stack we absolutely need to make sure that their 
pointer is not flying around anywhere when the function returns. Keep in 
mind that a stack is just another form of memory management.


It's certainly possible to get that right, but it's just the more error 
prone approach.


Regards,
Christian.



Andrey




Regards,
Christian.



Andrey




Not 100% sure if that works, but something like that should do the 
trick.


My main concern is that I don't want to allocate the work items on 
the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array 
or other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current way 
all this done automatically within reset_domain code and it's 
agnostic to specific driver and it's specific list of reset 
sources. Also in case we would want to generalize reset_domain 
to other GPU drivers (which was
a plan as far as i remember) this explicit mention of each 
reset works for cancellation is again less suitable in my opinion.


Well we could put the work item for the scheduler independent 
reset source into the reset domain as well. But I'm not sure 
those additional reset sources should be part of any common 
handling, that is largely amdgpu specific.



So it's for sure more then one source for the reasons described 
above, also note that for scheduler we already cancel delayed 
work in drm_sched_stop so calling them again in amdgpu code kind 
of superfluous.


Andrey




Christian.



Andrey






Andrey






The only difference is I chose to do the canceling right 
BEFORE the HW reset and not AFTER. I did this because I see 
a possible race where a new reset request is being 
generated exactly after we finished the HW reset but before 
we canceled out all pending resets - in such case you wold 
not want to cancel this 'border line new' reset request.


Why not? Any new reset request directly after a hardware 
reset is most likely just falsely generated by the reset 
itself.


Ideally I would cancel all sources after the reset, but 
before starting any new work.


Regards,
Christian.




Andrey




Regards,
Christian.

You can see that if many different reset sources share 
same work struct what can happen is that the first to 
obtain the lock you describe bellow might opt out from 
full HW reset because his bad job did signal for example 
or 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky



On 2022-05-11 11:39, Christian König wrote:

Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:

On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just 
bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a 
fix to only insert reset work to pending reset list if it's not 
already there), the point of using list (or array) to which you 
add and from which you remove is that the logic of this is 
encapsulated within reset domain. In your way we need to be 
aware who exactly schedules reset work and explicitly cancel 
them, this also means that for any new source added in the 
future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if 
you share 'reset source' (meaning a delayed work) with another 
client (i.e. RAS and KFD) it means you make assumption that the 
other client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non scheduler 
reset clients assume the reset is always fully executed in HW 
while scheduler based reset makes shortcuts and not always does HW 
reset hence they cannot share 'reset source' (delayed work). Yes, 
we can always add this in the future if and when such problem will 
arise but no one will remember this then and a new bug will be 
introduced and will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle the 
new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index for 
each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain later 
for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back the 
return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


You should try to avoid that as well.

See when the debugfs reset is canceled because of a scheduler reset 
you won't get a useful return code either.


What we should do instead is to cache the status of the last reset in 
the reset domain.


Regards,
Christian.



Another problem with this approach -  to execute  the actaul GPU reset I 
need accesses  to concrete amdgpu_device pointer from work struct (see 
xgpu_ai_mailbox_flr_work) as example. If i store all work items in
array in amdgpu_reset_domain the most i can only retrieve is the 
reset_domain struct itself which won't help since it's dynamically 
allocated, not embedded in hive or adev and can can be one per device or 
per entire hive in case of XGMI and so there is no way for me to reach 
back to amdgpu_device. Back pointer to adev* from amdgpu_reset_domain 
will only work for single device but not for XGMI hive where there are 
multiple devices in a hive.


Andrey






Andrey




Not 100% sure if that works, but something like that should do the 
trick.


My main concern is that I don't want to allocate the work items on 
the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array 
or other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky



On 2022-05-11 11:39, Christian König wrote:

Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:

On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just 
bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a 
fix to only insert reset work to pending reset list if it's not 
already there), the point of using list (or array) to which you 
add and from which you remove is that the logic of this is 
encapsulated within reset domain. In your way we need to be 
aware who exactly schedules reset work and explicitly cancel 
them, this also means that for any new source added in the 
future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if 
you share 'reset source' (meaning a delayed work) with another 
client (i.e. RAS and KFD) it means you make assumption that the 
other client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non scheduler 
reset clients assume the reset is always fully executed in HW 
while scheduler based reset makes shortcuts and not always does HW 
reset hence they cannot share 'reset source' (delayed work). Yes, 
we can always add this in the future if and when such problem will 
arise but no one will remember this then and a new bug will be 
introduced and will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle the 
new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index for 
each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain later 
for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back the 
return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


You should try to avoid that as well.



Why ?




See when the debugfs reset is canceled because of a scheduler reset 
you won't get a useful return code either.



That true.



What we should do instead is to cache the status of the last reset in 
the reset domain.



Probably an atomic ret in reset_domain then.

Another 2 points i forgot to ask -

1) What race condition you had in mind about insertion and extraction 
from the list if all is done under lock ?


2) This I asked already - why you opposed so much to allocation on the 
stack ? I understand the problem with dynamic memory allocations but why 
stack ? We do multiple allocations on the stack for any function
we call during GPU reset, what so special in this case where we allocate 
work struct on the stack? It's not like the work struct is especially 
big compared to other stuff we allocate on the stack.


Andrey




Regards,
Christian.



Andrey




Not 100% sure if that works, but something like that should do the 
trick.


My main concern is that I don't want to allocate the work items on 
the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array 
or other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current way 
all this done 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky



On 2022-05-11 11:46, Lazar, Lijo wrote:



On 5/11/2022 9:13 PM, Andrey Grodzovsky wrote:


On 2022-05-11 11:37, Lazar, Lijo wrote:



On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote:


On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired 
and which hasn't, because that would be racy again. Instead 
just bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a 
fix to only insert reset work to pending reset list if it's 
not already there), the point of using list (or array) to 
which you add and from which you remove is that the logic of 
this is encapsulated within reset domain. In your way we need 
to be aware who exactly schedules reset work and explicitly 
cancel them, this also means that for any new source added in 
the future you will need to remember to add him


I don't think that this is a valid argument. Additionally to 
the schedulers we probably just need less than a handful of 
reset sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if 
you share 'reset source' (meaning a delayed work) with another 
client (i.e. RAS and KFD) it means you make assumption that the 
other client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non 
scheduler reset clients assume the reset is always fully 
executed in HW while scheduler based reset makes shortcuts and 
not always does HW reset hence they cannot share 'reset source' 
(delayed work). Yes, we can always add this in the future if and 
when such problem will arise but no one will remember this then 
and a new bug will be introduced and will take time to find and 
resolve.


Mhm, so your main concern is that we forget to correctly handle 
the new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index 
for each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain 
later for other drivers.


The current reset domain queue design is not good for a 
hierarchichal reset within amdgpu itself :)


Thanks,
Lijo



Not sure what do you mean ?



It's tied to the TDR queue in scheduler.

Hierarchichal model - start from reset of lowest level nodes and on 
failure try with a higher level reset. This model doesn't suit that.


Thanks,
Lijo



The TDR queue just provides a single threaded context to execute all 
resets. It has no restrictions on what exactly you reset within each 
work item on the queue
so I still don't see a problem. I also don't understand what is lower 
level vs higher level nodes in our case. Is it single node vs hive ?


Andrey





Andrey




But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back 
the return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


Andrey




Not 100% sure if that works, but something like that should do 
the trick.


My main concern is that I don't want to allocate the work items 
on the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array 
or other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current 
way all this done automatically within reset_domain code and 
it's agnostic to specific driver and it's specific list of 
reset sources. Also in case we would want to 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Lazar, Lijo




On 5/11/2022 9:13 PM, Andrey Grodzovsky wrote:


On 2022-05-11 11:37, Lazar, Lijo wrote:



On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote:


On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just 
bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a 
fix to only insert reset work to pending reset list if it's not 
already there), the point of using list (or array) to which you 
add and from which you remove is that the logic of this is 
encapsulated within reset domain. In your way we need to be 
aware who exactly schedules reset work and explicitly cancel 
them, this also means that for any new source added in the 
future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if 
you share 'reset source' (meaning a delayed work) with another 
client (i.e. RAS and KFD) it means you make assumption that the 
other client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non scheduler 
reset clients assume the reset is always fully executed in HW 
while scheduler based reset makes shortcuts and not always does HW 
reset hence they cannot share 'reset source' (delayed work). Yes, 
we can always add this in the future if and when such problem will 
arise but no one will remember this then and a new bug will be 
introduced and will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle the 
new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index for 
each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain later 
for other drivers.


The current reset domain queue design is not good for a hierarchichal 
reset within amdgpu itself :)


Thanks,
Lijo



Not sure what do you mean ?



It's tied to the TDR queue in scheduler.

Hierarchichal model - start from reset of lowest level nodes and on 
failure try with a higher level reset. This model doesn't suit that.


Thanks,
Lijo


Andrey




But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back the 
return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


Andrey




Not 100% sure if that works, but something like that should do the 
trick.


My main concern is that I don't want to allocate the work items on 
the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array 
or other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current way 
all this done automatically within reset_domain code and it's 
agnostic to specific driver and it's specific list of reset 
sources. Also in case we would want to generalize reset_domain 
to other GPU drivers (which was
a plan as far as i remember) this explicit mention of each reset 
works for cancellation is again less suitable in my opinion.


Well we could put the work item for the scheduler independent 
reset source into the reset domain as well. But I'm not sure 
those additional reset sources should be part of 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky



On 2022-05-11 11:37, Lazar, Lijo wrote:



On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote:


On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, 
that's debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just 
bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a 
fix to only insert reset work to pending reset list if it's not 
already there), the point of using list (or array) to which you 
add and from which you remove is that the logic of this is 
encapsulated within reset domain. In your way we need to be 
aware who exactly schedules reset work and explicitly cancel 
them, this also means that for any new source added in the 
future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if 
you share 'reset source' (meaning a delayed work) with another 
client (i.e. RAS and KFD) it means you make assumption that the 
other client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non scheduler 
reset clients assume the reset is always fully executed in HW 
while scheduler based reset makes shortcuts and not always does HW 
reset hence they cannot share 'reset source' (delayed work). Yes, 
we can always add this in the future if and when such problem will 
arise but no one will remember this then and a new bug will be 
introduced and will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle the 
new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index for 
each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain later 
for other drivers.


The current reset domain queue design is not good for a hierarchichal 
reset within amdgpu itself :)


Thanks,
Lijo



Not sure what do you mean ?

Andrey




But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset 
and so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back the 
return value from gpu_recover like I do now in 
amdgpu_device_gpu_recover.


Andrey




Not 100% sure if that works, but something like that should do the 
trick.


My main concern is that I don't want to allocate the work items on 
the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array 
or other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current way 
all this done automatically within reset_domain code and it's 
agnostic to specific driver and it's specific list of reset 
sources. Also in case we would want to generalize reset_domain 
to other GPU drivers (which was
a plan as far as i remember) this explicit mention of each reset 
works for cancellation is again less suitable in my opinion.


Well we could put the work item for the scheduler independent 
reset source into the reset domain as well. But I'm not sure 
those additional reset sources should be part of any common 
handling, that is largely amdgpu specific.



So it's for sure more then one source for the reasons described 
above, also note that for scheduler we already cancel delayed work 
in drm_sched_stop so calling them again in amdgpu code kind 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Christian König

Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky:

On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just 
bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a 
fix to only insert reset work to pending reset list if it's not 
already there), the point of using list (or array) to which you 
add and from which you remove is that the logic of this is 
encapsulated within reset domain. In your way we need to be aware 
who exactly schedules reset work and explicitly cancel them, this 
also means that for any new source added in the future you will 
need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if you 
share 'reset source' (meaning a delayed work) with another client 
(i.e. RAS and KFD) it means you make assumption that the other 
client always proceeds with the
reset exactly the same way as you expect. So today we have this 
only in scheduler vs non scheduler reset happening - non scheduler 
reset clients assume the reset is always fully executed in HW while 
scheduler based reset makes shortcuts and not always does HW reset 
hence they cannot share 'reset source' (delayed work). Yes, we can 
always add this in the future if and when such problem will arise 
but no one will remember this then and a new bug will be introduced 
and will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle the 
new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index for 
each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain later 
for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's 
usage in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset and 
so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back the 
return value from gpu_recover like I do now in amdgpu_device_gpu_recover.


You should try to avoid that as well.

See when the debugfs reset is canceled because of a scheduler reset you 
won't get a useful return code either.


What we should do instead is to cache the status of the last reset in 
the reset domain.


Regards,
Christian.



Andrey




Not 100% sure if that works, but something like that should do the 
trick.


My main concern is that I don't want to allocate the work items on 
the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array or 
other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current way 
all this done automatically within reset_domain code and it's 
agnostic to specific driver and it's specific list of reset 
sources. Also in case we would want to generalize reset_domain to 
other GPU drivers (which was
a plan as far as i remember) this explicit mention of each reset 
works for cancellation is again less suitable in my opinion.


Well we could put the work item for the scheduler independent 
reset source into the reset domain as well. But I'm not sure those 
additional reset sources should be part of any common handling, 
that is largely amdgpu specific.



So it's for sure more then one source for the reasons described 
above, also note that for scheduler we already cancel delayed work 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Lazar, Lijo




On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote:


On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just 
bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a fix 
to only insert reset work to pending reset list if it's not 
already there), the point of using list (or array) to which you 
add and from which you remove is that the logic of this is 
encapsulated within reset domain. In your way we need to be aware 
who exactly schedules reset work and explicitly cancel them, this 
also means that for any new source added in the future you will 
need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if you 
share 'reset source' (meaning a delayed work) with another client 
(i.e. RAS and KFD) it means you make assumption that the other 
client always proceeds with the
reset exactly the same way as you expect. So today we have this only 
in scheduler vs non scheduler reset happening - non scheduler reset 
clients assume the reset is always fully executed in HW while 
scheduler based reset makes shortcuts and not always does HW reset 
hence they cannot share 'reset source' (delayed work). Yes, we can 
always add this in the future if and when such problem will arise 
but no one will remember this then and a new bug will be introduced 
and will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle the 
new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index for 
each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain later 
for other drivers.


The current reset domain queue design is not good for a hierarchichal 
reset within amdgpu itself :)


Thanks,
Lijo

But still one caveat, look at amdgpu_recover_work_struct and it's usage 
in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset and 
so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back the 
return value from gpu_recover like I do now in amdgpu_device_gpu_recover.


Andrey




Not 100% sure if that works, but something like that should do the 
trick.


My main concern is that I don't want to allocate the work items on 
the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array or 
other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current way 
all this done automatically within reset_domain code and it's 
agnostic to specific driver and it's specific list of reset 
sources. Also in case we would want to generalize reset_domain to 
other GPU drivers (which was
a plan as far as i remember) this explicit mention of each reset 
works for cancellation is again less suitable in my opinion.


Well we could put the work item for the scheduler independent reset 
source into the reset domain as well. But I'm not sure those 
additional reset sources should be part of any common handling, 
that is largely amdgpu specific.



So it's for sure more then one source for the reasons described 
above, also note that for scheduler we already cancel delayed work 
in drm_sched_stop so calling them again in amdgpu code kind of 
superfluous.


Andrey




Christian.



Andrey






Andrey






The only 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky



On 2022-05-11 11:20, Lazar, Lijo wrote:



On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just 
bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a fix 
to only insert reset work to pending reset list if it's not 
already there), the point of using list (or array) to which you 
add and from which you remove is that the logic of this is 
encapsulated within reset domain. In your way we need to be aware 
who exactly schedules reset work and explicitly cancel them, this 
also means that for any new source added in the future you will 
need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate 
reset sources would be if somebody wants to know if a specific 
source has been handled or not (e.g. call flush_work() or 
work_pending()). Like in the case of a reset triggered through 
debugfs.



This is indeed one reason, another is as we said before that if you 
share 'reset source' (meaning a delayed work) with another client 
(i.e. RAS and KFD) it means you make assumption that the other 
client always proceeds with the
reset exactly the same way as you expect. So today we have this only 
in scheduler vs non scheduler reset happening - non scheduler reset 
clients assume the reset is always fully executed in HW while 
scheduler based reset makes shortcuts and not always does HW reset 
hence they cannot share 'reset source' (delayed work). Yes, we can 
always add this in the future if and when such problem will arise 
but no one will remember this then and a new bug will be introduced 
and will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle the 
new reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
 
 union {
     struct {
         struct work_item debugfs;
         struct work_item ras;
         
     };
     struct work_item array[]
 } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index for 
each work item



Thanks,
Lijo



It's possible though it makes harder to generalize reset_domain later 
for other drivers.
But still one caveat, look at amdgpu_recover_work_struct and it's usage 
in amdgpu_device_gpu_recover and in gpu_recover_get,
At least for debugfs i need to return back the result of GPU reset and 
so I cannot store actual work items in the array mentioned above
but rather pointers to work_item because i need a way to get back the 
return value from gpu_recover like I do now in amdgpu_device_gpu_recover.


Andrey




Not 100% sure if that works, but something like that should do the 
trick.


My main concern is that I don't want to allocate the work items on 
the stack and dynamic allocation (e.g. kmalloc) is usually not 
possible either.


Additional to that putting/removing work items from a list, array or 
other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current way 
all this done automatically within reset_domain code and it's 
agnostic to specific driver and it's specific list of reset 
sources. Also in case we would want to generalize reset_domain to 
other GPU drivers (which was
a plan as far as i remember) this explicit mention of each reset 
works for cancellation is again less suitable in my opinion.


Well we could put the work item for the scheduler independent reset 
source into the reset domain as well. But I'm not sure those 
additional reset sources should be part of any common handling, 
that is largely amdgpu specific.



So it's for sure more then one source for the reasons described 
above, also note that for scheduler we already cancel delayed work 
in drm_sched_stop so calling them again in amdgpu code kind of 
superfluous.


Andrey




Christian.



Andrey






Andrey






The only difference is I chose to do the canceling right 
BEFORE the HW reset and not AFTER. I did this because I see a 
possible race where a new reset request is being generated 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Lazar, Lijo




On 5/11/2022 7:28 PM, Christian König wrote:

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just 
bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a fix 
to only insert reset work to pending reset list if it's not already 
there), the point of using list (or array) to which you add and from 
which you remove is that the logic of this is encapsulated within 
reset domain. In your way we need to be aware who exactly schedules 
reset work and explicitly cancel them, this also means that for any 
new source added in the future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate reset 
sources would be if somebody wants to know if a specific source has 
been handled or not (e.g. call flush_work() or work_pending()). Like 
in the case of a reset triggered through debugfs.



This is indeed one reason, another is as we said before that if you 
share 'reset source' (meaning a delayed work) with another client 
(i.e. RAS and KFD) it means you make assumption that the other client 
always proceeds with the
reset exactly the same way as you expect. So today we have this only 
in scheduler vs non scheduler reset happening - non scheduler reset 
clients assume the reset is always fully executed in HW while 
scheduler based reset makes shortcuts and not always does HW reset 
hence they cannot share 'reset source' (delayed work). Yes, we can 
always add this in the future if and when such problem will arise but 
no one will remember this then and a new bug will be introduced and 
will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle the new 
reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
     
     union {
         struct {
             struct work_item debugfs;
             struct work_item ras;
             
         };
         struct work_item array[]
     } reset_sources;
}



If it's only about static array,

enum amdgpu_reset_soruce {

AMDGPU_RESET_SRC_RAS,
AMDGPU_RESET_SRC_ABC,
.
AMDGPU_RESET_SRC_XYZ,
AMDGPU_RESET_SRC_MAX

};

struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index for 
each work item



Thanks,
Lijo


Not 100% sure if that works, but something like that should do the trick.

My main concern is that I don't want to allocate the work items on the 
stack and dynamic allocation (e.g. kmalloc) is usually not possible either.


Additional to that putting/removing work items from a list, array or 
other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current way all 
this done automatically within reset_domain code and it's agnostic 
to specific driver and it's specific list of reset sources. Also in 
case we would want to generalize reset_domain to other GPU drivers 
(which was
a plan as far as i remember) this explicit mention of each reset 
works for cancellation is again less suitable in my opinion.


Well we could put the work item for the scheduler independent reset 
source into the reset domain as well. But I'm not sure those 
additional reset sources should be part of any common handling, that 
is largely amdgpu specific.



So it's for sure more then one source for the reasons described above, 
also note that for scheduler we already cancel delayed work in 
drm_sched_stop so calling them again in amdgpu code kind of superfluous.


Andrey




Christian.



Andrey






Andrey






The only difference is I chose to do the canceling right BEFORE 
the HW reset and not AFTER. I did this because I see a possible 
race where a new reset request is being generated exactly after 
we finished the HW reset but before we canceled out all pending 
resets - in such case you wold not want to cancel this 'border 
line new' reset request.


Why not? Any new reset request directly after a hardware reset is 
most likely just falsely generated by the reset itself.


Ideally I would cancel all sources after the reset, but before 
starting any new work.


Regards,
Christian.




Andrey




Regards,
Christian.

You can see that if many different reset sources share same 
work struct what can happen is that the first to obtain the 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Christian König

Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:

On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:

[SNIP]
E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just 
bluntly reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a fix 
to only insert reset work to pending reset list if it's not already 
there), the point of using list (or array) to which you add and from 
which you remove is that the logic of this is encapsulated within 
reset domain. In your way we need to be aware who exactly schedules 
reset work and explicitly cancel them, this also means that for any 
new source added in the future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset 
sources, most likely even just one or two is enough.


The only justification I can see of having additional separate reset 
sources would be if somebody wants to know if a specific source has 
been handled or not (e.g. call flush_work() or work_pending()). Like 
in the case of a reset triggered through debugfs.



This is indeed one reason, another is as we said before that if you 
share 'reset source' (meaning a delayed work) with another client 
(i.e. RAS and KFD) it means you make assumption that the other client 
always proceeds with the
reset exactly the same way as you expect. So today we have this only 
in scheduler vs non scheduler reset happening - non scheduler reset 
clients assume the reset is always fully executed in HW while 
scheduler based reset makes shortcuts and not always does HW reset 
hence they cannot share 'reset source' (delayed work). Yes, we can 
always add this in the future if and when such problem will arise but 
no one will remember this then and a new bug will be introduced and 
will take time to find and resolve.


Mhm, so your main concern is that we forget to correctly handle the new 
reset sources?


How about we do it like this then:

struct amdgpu_reset_domain {
    
    union {
        struct {
            struct work_item debugfs;
            struct work_item ras;
            
        };
        struct work_item array[]
    } reset_sources;
}

Not 100% sure if that works, but something like that should do the trick.

My main concern is that I don't want to allocate the work items on the 
stack and dynamic allocation (e.g. kmalloc) is usually not possible either.


Additional to that putting/removing work items from a list, array or 
other container is a very common source for race conditions.


Regards,
Christian.



to the cancellation list which you showed above. In current way all 
this done automatically within reset_domain code and it's agnostic 
to specific driver and it's specific list of reset sources. Also in 
case we would want to generalize reset_domain to other GPU drivers 
(which was
a plan as far as i remember) this explicit mention of each reset 
works for cancellation is again less suitable in my opinion.


Well we could put the work item for the scheduler independent reset 
source into the reset domain as well. But I'm not sure those 
additional reset sources should be part of any common handling, that 
is largely amdgpu specific.



So it's for sure more then one source for the reasons described above, 
also note that for scheduler we already cancel delayed work in 
drm_sched_stop so calling them again in amdgpu code kind of superfluous.


Andrey




Christian.



Andrey






Andrey






The only difference is I chose to do the canceling right BEFORE 
the HW reset and not AFTER. I did this because I see a possible 
race where a new reset request is being generated exactly after 
we finished the HW reset but before we canceled out all pending 
resets - in such case you wold not want to cancel this 'border 
line new' reset request.


Why not? Any new reset request directly after a hardware reset is 
most likely just falsely generated by the reset itself.


Ideally I would cancel all sources after the reset, but before 
starting any new work.


Regards,
Christian.




Andrey




Regards,
Christian.

You can see that if many different reset sources share same 
work struct what can happen is that the first to obtain the 
lock you describe bellow might opt out from full HW reset 
because his bad job did signal for example or because his 
hunged IP block was able to recover through SW reset but in 
the meantime another reset source who needed an actual HW 
reset just silently returned and we end up with unhandled 
reset request. True that 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky



On 2022-05-11 03:38, Christian König wrote:

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:


On 2022-05-10 13:19, Christian König wrote:

Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky:


On 2022-05-10 12:17, Christian König wrote:

Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky:

[SNIP]
That's one of the reasons why we should have multiple work items 
for job based reset and other reset sources.


See the whole idea is the following:
1. We have one single queued work queue for each reset domain 
which makes sure that all reset requests execute in order.
2. We have one delayed work item for each scheduler which fires 
when a timeout on a scheduler occurs and eventually calls the 
reset procedure with the last running job.

3. We have one work item for each necessary hard reset.

The delayed work item from the scheduler first tries a soft 
recovery and checks if a hard reset is really necessary. If it's 
not necessary and we can cancel the offending job we skip the 
hard reset.


The hard reset work item doesn't do any of those checks and just 
does a reset no matter what.


When we really do a reset, independent if its triggered by a job 
or other source we cancel all sources at the end of the reset 
procedure.


This makes sure that a) We only do one reset even when multiple 
sources fire at the same time and b) when any source bails out 
and only does a soft recovery we do a full reset anyway when 
necessary.


That design was outlined multiple times now on the mailing list 
and looks totally clear to me. We should probably document that 
somewhere.



If you look at the patch what you described above is exactly what 
is happening - since scheduler's delayed work is different from 
any non scheduler delayed work the SW reset which might take 
place from scheduler's reset
will not have any impact on any non scheduler delayed work and 
will not cancel them. In case the scheduler actually reaches the 
point of HW reset it will cancel out all pending reset works from 
any other sources on the same
reset domain. Non scheduler reset will always proceed to do full 
HW reset and will cancel any other pending resets.


Ok, but why you then need that linked list? The number of reset 
sources should be static and not in any way dynamic.



So array reset_src[i] holds a pointer to pending delayed work from 
source i or NULL if no pedning work ?
What if same source triggers multiple reset requests such as 
multiple RAS errors at once , don't set the delayed work pointer in 
the arr[RAS_index] if it's already not NULL ?




See using the linked list sounds like you only wanted to cancel 
the reset sources raised so far which would not be correct as far 
as I can see.



Not clear about this one ? We do want to cancel those reset sources 
that were raised so far because we just did a HW reset which should 
fix them anyway ? Those who not raised reset request so far their 
respective array index will have a NULL ptr.


And exactly that's what I want to prevent. See you don't care if a 
reset source has fired once, twice, ten times or never. You just 
cancel all of them!


That's why I want to come to a static list of reset sources.

E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and 
which hasn't, because that would be racy again. Instead just bluntly 
reset all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a fix to 
only insert reset work to pending reset list if it's not already 
there), the point of using list (or array) to which you add and from 
which you remove is that the logic of this is encapsulated within 
reset domain. In your way we need to be aware who exactly schedules 
reset work and explicitly cancel them, this also means that for any 
new source added in the future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset sources, 
most likely even just one or two is enough.


The only justification I can see of having additional separate reset 
sources would be if somebody wants to know if a specific source has 
been handled or not (e.g. call flush_work() or work_pending()). Like 
in the case of a reset triggered through debugfs.



This is indeed one reason, another is as we said before that if you 
share 'reset source' (meaning a delayed work) with another client (i.e. 
RAS and KFD) it means you make assumption that the other client always 
proceeds with the
reset exactly the same way as you expect. So today we have this only in 
scheduler vs non scheduler reset happening - non scheduler reset clients 
assume the reset is always fully executed in 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Christian König

Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:


On 2022-05-10 13:19, Christian König wrote:

Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky:


On 2022-05-10 12:17, Christian König wrote:

Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky:

[SNIP]
That's one of the reasons why we should have multiple work items 
for job based reset and other reset sources.


See the whole idea is the following:
1. We have one single queued work queue for each reset domain 
which makes sure that all reset requests execute in order.
2. We have one delayed work item for each scheduler which fires 
when a timeout on a scheduler occurs and eventually calls the 
reset procedure with the last running job.

3. We have one work item for each necessary hard reset.

The delayed work item from the scheduler first tries a soft 
recovery and checks if a hard reset is really necessary. If it's 
not necessary and we can cancel the offending job we skip the 
hard reset.


The hard reset work item doesn't do any of those checks and just 
does a reset no matter what.


When we really do a reset, independent if its triggered by a job 
or other source we cancel all sources at the end of the reset 
procedure.


This makes sure that a) We only do one reset even when multiple 
sources fire at the same time and b) when any source bails out 
and only does a soft recovery we do a full reset anyway when 
necessary.


That design was outlined multiple times now on the mailing list 
and looks totally clear to me. We should probably document that 
somewhere.



If you look at the patch what you described above is exactly what 
is happening - since scheduler's delayed work is different from 
any non scheduler delayed work the SW reset which might take place 
from scheduler's reset
will not have any impact on any non scheduler delayed work and 
will not cancel them. In case the scheduler actually reaches the 
point of HW reset it will cancel out all pending reset works from 
any other sources on the same
reset domain. Non scheduler reset will always proceed to do full 
HW reset and will cancel any other pending resets.


Ok, but why you then need that linked list? The number of reset 
sources should be static and not in any way dynamic.



So array reset_src[i] holds a pointer to pending delayed work from 
source i or NULL if no pedning work ?
What if same source triggers multiple reset requests such as 
multiple RAS errors at once , don't set the delayed work pointer in 
the arr[RAS_index] if it's already not NULL ?




See using the linked list sounds like you only wanted to cancel the 
reset sources raised so far which would not be correct as far as I 
can see.



Not clear about this one ? We do want to cancel those reset sources 
that were raised so far because we just did a HW reset which should 
fix them anyway ? Those who not raised reset request so far their 
respective array index will have a NULL ptr.


And exactly that's what I want to prevent. See you don't care if a 
reset source has fired once, twice, ten times or never. You just 
cancel all of them!


That's why I want to come to a static list of reset sources.

E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and which 
hasn't, because that would be racy again. Instead just bluntly reset 
all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a fix to 
only insert reset work to pending reset list if it's not already 
there), the point of using list (or array) to which you add and from 
which you remove is that the logic of this is encapsulated within 
reset domain. In your way we need to be aware who exactly schedules 
reset work and explicitly cancel them, this also means that for any 
new source added in the future you will need to remember to add him


I don't think that this is a valid argument. Additionally to the 
schedulers we probably just need less than a handful of reset sources, 
most likely even just one or two is enough.


The only justification I can see of having additional separate reset 
sources would be if somebody wants to know if a specific source has been 
handled or not (e.g. call flush_work() or work_pending()). Like in the 
case of a reset triggered through debugfs.


to the cancellation list which you showed above. In current way all 
this done automatically within reset_domain code and it's agnostic to 
specific driver and it's specific list of reset sources. Also in case 
we would want to generalize reset_domain to other GPU drivers (which was
a plan as far as i remember) this explicit mention of each reset works 
for cancellation is again less suitable in my opinion.


Well we could put the work item for the scheduler 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Andrey Grodzovsky



On 2022-05-10 13:19, Christian König wrote:

Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky:


On 2022-05-10 12:17, Christian König wrote:

Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky:

[SNIP]
That's one of the reasons why we should have multiple work items 
for job based reset and other reset sources.


See the whole idea is the following:
1. We have one single queued work queue for each reset domain 
which makes sure that all reset requests execute in order.
2. We have one delayed work item for each scheduler which fires 
when a timeout on a scheduler occurs and eventually calls the 
reset procedure with the last running job.

3. We have one work item for each necessary hard reset.

The delayed work item from the scheduler first tries a soft 
recovery and checks if a hard reset is really necessary. If it's 
not necessary and we can cancel the offending job we skip the hard 
reset.


The hard reset work item doesn't do any of those checks and just 
does a reset no matter what.


When we really do a reset, independent if its triggered by a job 
or other source we cancel all sources at the end of the reset 
procedure.


This makes sure that a) We only do one reset even when multiple 
sources fire at the same time and b) when any source bails out and 
only does a soft recovery we do a full reset anyway when necessary.


That design was outlined multiple times now on the mailing list 
and looks totally clear to me. We should probably document that 
somewhere.



If you look at the patch what you described above is exactly what 
is happening - since scheduler's delayed work is different from any 
non scheduler delayed work the SW reset which might take place from 
scheduler's reset
will not have any impact on any non scheduler delayed work and will 
not cancel them. In case the scheduler actually reaches the point 
of HW reset it will cancel out all pending reset works from any 
other sources on the same
reset domain. Non scheduler reset will always proceed to do full HW 
reset and will cancel any other pending resets.


Ok, but why you then need that linked list? The number of reset 
sources should be static and not in any way dynamic.



So array reset_src[i] holds a pointer to pending delayed work from 
source i or NULL if no pedning work ?
What if same source triggers multiple reset requests such as multiple 
RAS errors at once , don't set the delayed work pointer in the 
arr[RAS_index] if it's already not NULL ?




See using the linked list sounds like you only wanted to cancel the 
reset sources raised so far which would not be correct as far as I 
can see.



Not clear about this one ? We do want to cancel those reset sources 
that were raised so far because we just did a HW reset which should 
fix them anyway ? Those who not raised reset request so far their 
respective array index will have a NULL ptr.


And exactly that's what I want to prevent. See you don't care if a 
reset source has fired once, twice, ten times or never. You just 
cancel all of them!


That's why I want to come to a static list of reset sources.

E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and which 
hasn't, because that would be racy again. Instead just bluntly reset 
all possible sources.


Christian.



I don't say we care if it fired once or twice (I need to add a fix to 
only insert reset work to pending reset list if it's not already there), 
the point of using list (or array) to which you add and from which you 
remove is that the logic of this is encapsulated within reset domain. In 
your way we need to be aware who exactly schedules reset work and 
explicitly cancel them, this also means that for any new source added in 
the future you will need to remember to add him
to the cancellation list which you showed above. In current way all this 
done automatically within reset_domain code and it's agnostic to 
specific driver and it's specific list of reset sources. Also in case we 
would want to generalize reset_domain to other GPU drivers (which was
a plan as far as i remember) this explicit mention of each reset works 
for cancellation is again less suitable in my opinion.


Andrey






Andrey






The only difference is I chose to do the canceling right BEFORE the 
HW reset and not AFTER. I did this because I see a possible race 
where a new reset request is being generated exactly after we 
finished the HW reset but before we canceled out all pending resets 
- in such case you wold not want to cancel this 'border line new' 
reset request.


Why not? Any new reset request directly after a hardware reset is 
most likely just falsely generated by the reset itself.


Ideally I would cancel all sources after the reset, 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Christian König

Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky:


On 2022-05-10 12:17, Christian König wrote:

Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky:

[SNIP]
That's one of the reasons why we should have multiple work items 
for job based reset and other reset sources.


See the whole idea is the following:
1. We have one single queued work queue for each reset domain which 
makes sure that all reset requests execute in order.
2. We have one delayed work item for each scheduler which fires 
when a timeout on a scheduler occurs and eventually calls the reset 
procedure with the last running job.

3. We have one work item for each necessary hard reset.

The delayed work item from the scheduler first tries a soft 
recovery and checks if a hard reset is really necessary. If it's 
not necessary and we can cancel the offending job we skip the hard 
reset.


The hard reset work item doesn't do any of those checks and just 
does a reset no matter what.


When we really do a reset, independent if its triggered by a job or 
other source we cancel all sources at the end of the reset procedure.


This makes sure that a) We only do one reset even when multiple 
sources fire at the same time and b) when any source bails out and 
only does a soft recovery we do a full reset anyway when necessary.


That design was outlined multiple times now on the mailing list and 
looks totally clear to me. We should probably document that somewhere.



If you look at the patch what you described above is exactly what is 
happening - since scheduler's delayed work is different from any non 
scheduler delayed work the SW reset which might take place from 
scheduler's reset
will not have any impact on any non scheduler delayed work and will 
not cancel them. In case the scheduler actually reaches the point of 
HW reset it will cancel out all pending reset works from any other 
sources on the same
reset domain. Non scheduler reset will always proceed to do full HW 
reset and will cancel any other pending resets.


Ok, but why you then need that linked list? The number of reset 
sources should be static and not in any way dynamic.



So array reset_src[i] holds a pointer to pending delayed work from 
source i or NULL if no pedning work ?
What if same source triggers multiple reset requests such as multiple 
RAS errors at once , don't set the delayed work pointer in the 
arr[RAS_index] if it's already not NULL ?




See using the linked list sounds like you only wanted to cancel the 
reset sources raised so far which would not be correct as far as I 
can see.



Not clear about this one ? We do want to cancel those reset sources 
that were raised so far because we just did a HW reset which should 
fix them anyway ? Those who not raised reset request so far their 
respective array index will have a NULL ptr.


And exactly that's what I want to prevent. See you don't care if a reset 
source has fired once, twice, ten times or never. You just cancel all of 
them!


That's why I want to come to a static list of reset sources.

E.g. in the reset code (either before or after the reset, that's 
debatable) you do something like this:


for (i = 0; i < num_ring; ++i)
    cancel_delayed_work(ring[i]->scheduler)
cancel_work(adev->ras_work);
cancel_work(adev->iofault_work);
cancel_work(adev->debugfs_work);
...

You don't really need to track which reset source has fired and which 
hasn't, because that would be racy again. Instead just bluntly reset all 
possible sources.


Christian.



Andrey






The only difference is I chose to do the canceling right BEFORE the 
HW reset and not AFTER. I did this because I see a possible race 
where a new reset request is being generated exactly after we 
finished the HW reset but before we canceled out all pending resets 
- in such case you wold not want to cancel this 'border line new' 
reset request.


Why not? Any new reset request directly after a hardware reset is 
most likely just falsely generated by the reset itself.


Ideally I would cancel all sources after the reset, but before 
starting any new work.


Regards,
Christian.




Andrey




Regards,
Christian.

You can see that if many different reset sources share same work 
struct what can happen is that the first to obtain the lock you 
describe bellow might opt out from full HW reset because his bad 
job did signal for example or because his hunged IP block was 
able to recover through SW reset but in the meantime another 
reset source who needed an actual HW reset just silently returned 
and we end up with unhandled reset request. True that today this 
happens only to job timeout reset sources that are handled form 
within the scheduler and won't use this single work struct but no 
one prevents a future case for this to happen and also, if we 
actually want to unify scheduler time out handlers within reset 
domain (which seems to me the right design approach) we won't be 
able to use just one work struct for this reason anyway.




Just to add to this 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Andrey Grodzovsky



On 2022-05-10 12:17, Christian König wrote:

Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky:

[SNIP]
That's one of the reasons why we should have multiple work items for 
job based reset and other reset sources.


See the whole idea is the following:
1. We have one single queued work queue for each reset domain which 
makes sure that all reset requests execute in order.
2. We have one delayed work item for each scheduler which fires when 
a timeout on a scheduler occurs and eventually calls the reset 
procedure with the last running job.

3. We have one work item for each necessary hard reset.

The delayed work item from the scheduler first tries a soft recovery 
and checks if a hard reset is really necessary. If it's not 
necessary and we can cancel the offending job we skip the hard reset.


The hard reset work item doesn't do any of those checks and just 
does a reset no matter what.


When we really do a reset, independent if its triggered by a job or 
other source we cancel all sources at the end of the reset procedure.


This makes sure that a) We only do one reset even when multiple 
sources fire at the same time and b) when any source bails out and 
only does a soft recovery we do a full reset anyway when necessary.


That design was outlined multiple times now on the mailing list and 
looks totally clear to me. We should probably document that somewhere.



If you look at the patch what you described above is exactly what is 
happening - since scheduler's delayed work is different from any non 
scheduler delayed work the SW reset which might take place from 
scheduler's reset
will not have any impact on any non scheduler delayed work and will 
not cancel them. In case the scheduler actually reaches the point of 
HW reset it will cancel out all pending reset works from any other 
sources on the same
reset domain. Non scheduler reset will always proceed to do full HW 
reset and will cancel any other pending resets.


Ok, but why you then need that linked list? The number of reset 
sources should be static and not in any way dynamic.



So array reset_src[i] holds a pointer to pending delayed work from 
source i or NULL if no pedning work ?
What if same source triggers multiple reset requests such as multiple 
RAS errors at once , don't set the delayed work pointer in the 
arr[RAS_index] if it's already not NULL ?




See using the linked list sounds like you only wanted to cancel the 
reset sources raised so far which would not be correct as far as I can 
see.



Not clear about this one ? We do want to cancel those reset sources that 
were raised so far because we just did a HW reset which should fix them 
anyway ? Those who not raised reset request so far their respective 
array index will have a NULL ptr.


Andrey






The only difference is I chose to do the canceling right BEFORE the 
HW reset and not AFTER. I did this because I see a possible race 
where a new reset request is being generated exactly after we 
finished the HW reset but before we canceled out all pending resets - 
in such case you wold not want to cancel this 'border line new' reset 
request.


Why not? Any new reset request directly after a hardware reset is most 
likely just falsely generated by the reset itself.


Ideally I would cancel all sources after the reset, but before 
starting any new work.


Regards,
Christian.




Andrey




Regards,
Christian.

You can see that if many different reset sources share same work 
struct what can happen is that the first to obtain the lock you 
describe bellow might opt out from full HW reset because his bad 
job did signal for example or because his hunged IP block was able 
to recover through SW reset but in the meantime another reset 
source who needed an actual HW reset just silently returned and we 
end up with unhandled reset request. True that today this happens 
only to job timeout reset sources that are handled form within the 
scheduler and won't use this single work struct but no one 
prevents a future case for this to happen and also, if we actually 
want to unify scheduler time out handlers within reset domain 
(which seems to me the right design approach) we won't be able to 
use just one work struct for this reason anyway.




Just to add to this point - a reset domain is co-operative domain. 
In addition to sharing a set of clients from various reset sources 
for one device, it also will have a set of devices like in XGMI 
hive. The job timeout on one device may not eventually result in 
result, but a RAS error happening on another device at the same 
time would need a reset. The second device's RAS error cannot 
return seeing  that a reset work already started, or ignore the 
reset work given that another device has filled the reset data.


When there is a reset domain, it should take care of the work 
scheduled and keeping it in device or any other level doesn't sound 
good.


Thanks,
Lijo


Andrey




I'd put the reset work struct into the reset_domain struct. 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Christian König

Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky:

[SNIP]
That's one of the reasons why we should have multiple work items for 
job based reset and other reset sources.


See the whole idea is the following:
1. We have one single queued work queue for each reset domain which 
makes sure that all reset requests execute in order.
2. We have one delayed work item for each scheduler which fires when 
a timeout on a scheduler occurs and eventually calls the reset 
procedure with the last running job.

3. We have one work item for each necessary hard reset.

The delayed work item from the scheduler first tries a soft recovery 
and checks if a hard reset is really necessary. If it's not necessary 
and we can cancel the offending job we skip the hard reset.


The hard reset work item doesn't do any of those checks and just does 
a reset no matter what.


When we really do a reset, independent if its triggered by a job or 
other source we cancel all sources at the end of the reset procedure.


This makes sure that a) We only do one reset even when multiple 
sources fire at the same time and b) when any source bails out and 
only does a soft recovery we do a full reset anyway when necessary.


That design was outlined multiple times now on the mailing list and 
looks totally clear to me. We should probably document that somewhere.



If you look at the patch what you described above is exactly what is 
happening - since scheduler's delayed work is different from any non 
scheduler delayed work the SW reset which might take place from 
scheduler's reset
will not have any impact on any non scheduler delayed work and will 
not cancel them. In case the scheduler actually reaches the point of 
HW reset it will cancel out all pending reset works from any other 
sources on the same
reset domain. Non scheduler reset will always proceed to do full HW 
reset and will cancel any other pending resets.


Ok, but why you then need that linked list? The number of reset sources 
should be static and not in any way dynamic.


See using the linked list sounds like you only wanted to cancel the 
reset sources raised so far which would not be correct as far as I can see.




The only difference is I chose to do the canceling right BEFORE the HW 
reset and not AFTER. I did this because I see a possible race where a 
new reset request is being generated exactly after we finished the HW 
reset but before we canceled out all pending resets - in such case you 
wold not want to cancel this 'border line new' reset request.


Why not? Any new reset request directly after a hardware reset is most 
likely just falsely generated by the reset itself.


Ideally I would cancel all sources after the reset, but before starting 
any new work.


Regards,
Christian.




Andrey




Regards,
Christian.

You can see that if many different reset sources share same work 
struct what can happen is that the first to obtain the lock you 
describe bellow might opt out from full HW reset because his bad 
job did signal for example or because his hunged IP block was able 
to recover through SW reset but in the meantime another reset 
source who needed an actual HW reset just silently returned and we 
end up with unhandled reset request. True that today this happens 
only to job timeout reset sources that are handled form within the 
scheduler and won't use this single work struct but no one prevents 
a future case for this to happen and also, if we actually want to 
unify scheduler time out handlers within reset domain (which seems 
to me the right design approach) we won't be able to use just one 
work struct for this reason anyway.




Just to add to this point - a reset domain is co-operative domain. 
In addition to sharing a set of clients from various reset sources 
for one device, it also will have a set of devices like in XGMI 
hive. The job timeout on one device may not eventually result in 
result, but a RAS error happening on another device at the same time 
would need a reset. The second device's RAS error cannot return 
seeing  that a reset work already started, or ignore the reset work 
given that another device has filled the reset data.


When there is a reset domain, it should take care of the work 
scheduled and keeping it in device or any other level doesn't sound 
good.


Thanks,
Lijo


Andrey




I'd put the reset work struct into the reset_domain struct. That 
way you'd have exactly one worker for the reset domain. You could 
implement a lock-less scheme to decide whether you need to 
schedule a reset, e.g. using an atomic counter in the shared work 
struct that gets incremented when a client wants to trigger a 
reset (atomic_add_return). If that counter is exactly 1 after 
incrementing, you need to fill in the rest of the work struct and 
schedule the work. Otherwise, it's already scheduled (or another 
client is in the process of scheduling it) and you just return. 
When the worker finishes (after confirming a successful reset), it 
resets 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Andrey Grodzovsky



On 2022-05-06 04:56, Christian König wrote:

Am 06.05.22 um 08:02 schrieb Lazar, Lijo:

On 5/6/2022 3:17 AM, Andrey Grodzovsky wrote:


On 2022-05-05 15:49, Felix Kuehling wrote:


Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky:


On 2022-05-05 11:06, Christian König wrote:

Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:


On 2022-05-05 09:23, Christian König wrote:

Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to 
discuss technical questions anyway so -


Good point.

It's needed because those other resets are not time out 
handlers that are governed by the scheduler
but rather external resets that are triggered by such clients 
as KFD, RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are 
serialized into same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset 
causes in turn another reset (from KFD in
this case) and we want to prevent this second reset from 
taking place just as we want to avoid multiple

TO resets to take place in scheduler code.


Yeah, but why do you need multiple workers?

You have a single worker for the GPU reset not triggered by the 
scheduler in you adev and cancel that at the end of the reset 
procedure.


If anybody things it needs to trigger another reset while in 
reset (which is actually a small design bug separately) the 
reset will just be canceled in the same way we cancel the 
scheduler resets.


Christian.



Had this in mind at first but then I realized that each client 
(RAS, KFD and sysfs) will want to fill his own data for the work 
(see amdgpu_device_gpu_recover) - for XGMI hive each will want 
to set his own adev (which is fine if you set a work per adev as 
you suggest) but also each client might want (they all put NULL 
there but in theory in the future) also set his own bad job 
value and here you might have a collision.


Yeah, but that is intentional. See when we have a job that needs 
to be consumed by the reset handler and not overwritten or 
something.



I am not sure why this is a requirement, multiple clients can 
decide concurrently to trigger a reset for some reason (possibly 
independent reasons) hence they cannot share same work struct to 
pass to it their data.


If those concurrent clients could detect that a reset was already 
in progress, you wouldn't need the complexity of multiple work 
structs being scheduled. You could simply return without triggering 
another reset.



In my view main problem here with single work struct either at reset 
domain level or even adev level is that in some cases we optimize 
resets and don't really perform ASIC HW reset (see 
amdgpu_job_timedout with soft recovery and skip_hw_reset in 
amdgpu_device_gpu_recover_imp for the case the bad job does get 
signaled just before we start HW reset and we just skip this).


That's one of the reasons why we should have multiple work items for 
job based reset and other reset sources.


See the whole idea is the following:
1. We have one single queued work queue for each reset domain which 
makes sure that all reset requests execute in order.
2. We have one delayed work item for each scheduler which fires when a 
timeout on a scheduler occurs and eventually calls the reset procedure 
with the last running job.

3. We have one work item for each necessary hard reset.

The delayed work item from the scheduler first tries a soft recovery 
and checks if a hard reset is really necessary. If it's not necessary 
and we can cancel the offending job we skip the hard reset.


The hard reset work item doesn't do any of those checks and just does 
a reset no matter what.


When we really do a reset, independent if its triggered by a job or 
other source we cancel all sources at the end of the reset procedure.


This makes sure that a) We only do one reset even when multiple 
sources fire at the same time and b) when any source bails out and 
only does a soft recovery we do a full reset anyway when necessary.


That design was outlined multiple times now on the mailing list and 
looks totally clear to me. We should probably document that somewhere.



If you look at the patch what you described above is exactly what is 
happening - since scheduler's delayed work is different from any non 
scheduler delayed work the SW reset which might 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-06 Thread Christian König

Am 06.05.22 um 08:02 schrieb Lazar, Lijo:

On 5/6/2022 3:17 AM, Andrey Grodzovsky wrote:


On 2022-05-05 15:49, Felix Kuehling wrote:


Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky:


On 2022-05-05 11:06, Christian König wrote:

Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:


On 2022-05-05 09:23, Christian König wrote:

Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to 
discuss technical questions anyway so -


Good point.

It's needed because those other resets are not time out 
handlers that are governed by the scheduler
but rather external resets that are triggered by such clients 
as KFD, RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are serialized 
into same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset 
causes in turn another reset (from KFD in
this case) and we want to prevent this second reset from taking 
place just as we want to avoid multiple

TO resets to take place in scheduler code.


Yeah, but why do you need multiple workers?

You have a single worker for the GPU reset not triggered by the 
scheduler in you adev and cancel that at the end of the reset 
procedure.


If anybody things it needs to trigger another reset while in 
reset (which is actually a small design bug separately) the 
reset will just be canceled in the same way we cancel the 
scheduler resets.


Christian.



Had this in mind at first but then I realized that each client 
(RAS, KFD and sysfs) will want to fill his own data for the work 
(see amdgpu_device_gpu_recover) - for XGMI hive each will want to 
set his own adev (which is fine if you set a work per adev as you 
suggest) but also each client might want (they all put NULL there 
but in theory in the future) also set his own bad job value and 
here you might have a collision.


Yeah, but that is intentional. See when we have a job that needs 
to be consumed by the reset handler and not overwritten or something.



I am not sure why this is a requirement, multiple clients can 
decide concurrently to trigger a reset for some reason (possibly 
independent reasons) hence they cannot share same work struct to 
pass to it their data.


If those concurrent clients could detect that a reset was already in 
progress, you wouldn't need the complexity of multiple work structs 
being scheduled. You could simply return without triggering another 
reset.



In my view main problem here with single work struct either at reset 
domain level or even adev level is that in some cases we optimize 
resets and don't really perform ASIC HW reset (see 
amdgpu_job_timedout with soft recovery and skip_hw_reset in 
amdgpu_device_gpu_recover_imp for the case the bad job does get 
signaled just before we start HW reset and we just skip this).


That's one of the reasons why we should have multiple work items for job 
based reset and other reset sources.


See the whole idea is the following:
1. We have one single queued work queue for each reset domain which 
makes sure that all reset requests execute in order.
2. We have one delayed work item for each scheduler which fires when a 
timeout on a scheduler occurs and eventually calls the reset procedure 
with the last running job.

3. We have one work item for each necessary hard reset.

The delayed work item from the scheduler first tries a soft recovery and 
checks if a hard reset is really necessary. If it's not necessary and we 
can cancel the offending job we skip the hard reset.


The hard reset work item doesn't do any of those checks and just does a 
reset no matter what.


When we really do a reset, independent if its triggered by a job or 
other source we cancel all sources at the end of the reset procedure.


This makes sure that a) We only do one reset even when multiple sources 
fire at the same time and b) when any source bails out and only does a 
soft recovery we do a full reset anyway when necessary.


That design was outlined multiple times now on the mailing list and 
looks totally clear to me. We should probably document that somewhere.


Regards,
Christian.

You can see that if many different reset sources share same work 
struct what can happen is that the first to obtain the lock you 
describe bellow might opt out from full HW reset because his bad job 
did signal for 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-06 Thread Lazar, Lijo




On 5/6/2022 3:17 AM, Andrey Grodzovsky wrote:


On 2022-05-05 15:49, Felix Kuehling wrote:


Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky:


On 2022-05-05 11:06, Christian König wrote:

Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:


On 2022-05-05 09:23, Christian König wrote:

Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to 
discuss technical questions anyway so -


Good point.

It's needed because those other resets are not time out handlers 
that are governed by the scheduler
but rather external resets that are triggered by such clients as 
KFD, RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are serialized 
into same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset 
causes in turn another reset (from KFD in
this case) and we want to prevent this second reset from taking 
place just as we want to avoid multiple

TO resets to take place in scheduler code.


Yeah, but why do you need multiple workers?

You have a single worker for the GPU reset not triggered by the 
scheduler in you adev and cancel that at the end of the reset 
procedure.


If anybody things it needs to trigger another reset while in reset 
(which is actually a small design bug separately) the reset will 
just be canceled in the same way we cancel the scheduler resets.


Christian.



Had this in mind at first but then I realized that each client 
(RAS, KFD and sysfs) will want to fill his own data for the work 
(see amdgpu_device_gpu_recover) - for XGMI hive each will want to 
set his own adev (which is fine if you set a work per adev as you 
suggest) but also each client might want (they all put NULL there 
but in theory in the future) also set his own bad job value and 
here you might have a collision.


Yeah, but that is intentional. See when we have a job that needs to 
be consumed by the reset handler and not overwritten or something.



I am not sure why this is a requirement, multiple clients can decide 
concurrently to trigger a reset for some reason (possibly independent 
reasons) hence they cannot share same work struct to pass to it their 
data.


If those concurrent clients could detect that a reset was already in 
progress, you wouldn't need the complexity of multiple work structs 
being scheduled. You could simply return without triggering another 
reset.



In my view main problem here with single work struct either at reset 
domain level or even adev level is that in some cases we optimize resets 
and don't really perform ASIC HW reset (see amdgpu_job_timedout with 
soft recovery and skip_hw_reset in amdgpu_device_gpu_recover_imp for the 
case the bad job does get signaled just before we start HW reset and we 
just skip this). You can see that if many different reset sources share 
same work struct what can happen is that the first to obtain the lock 
you describe bellow might opt out from full HW reset because his bad job 
did signal for example or because his hunged IP block was able to 
recover through SW reset but in the meantime another reset source who 
needed an actual HW reset just silently returned and we end up with 
unhandled reset request. True that today this happens only to job 
timeout reset sources that are handled form within the scheduler and 
won't use this single work struct but no one prevents a future case for 
this to happen and also, if we actually want to unify scheduler time out 
handlers within reset domain (which seems to me the right design 
approach) we won't be able to use just one work struct for this reason 
anyway.




Just to add to this point - a reset domain is co-operative domain. In 
addition to sharing a set of clients from various reset sources for one 
device, it also will have a set of devices like in XGMI hive. The job 
timeout on one device may not eventually result in result, but a RAS 
error happening on another device at the same time would need a reset. 
The second device's RAS error cannot return seeing  that a reset work 
already started, or ignore the reset work given that another device has 
filled the reset data.


When there is a reset domain, it should take care of the work scheduled 
and keeping it in device or any other level doesn't sound good.


Thanks,
Lijo


Andrey




I'd put the reset work struct into 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Luben Tuikov



On 2022-05-05 17:47, Andrey Grodzovsky wrote:
> 
> On 2022-05-05 15:49, Felix Kuehling wrote:
>>
>> Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-05-05 11:06, Christian König wrote:
 Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:
>
> On 2022-05-05 09:23, Christian König wrote:
>> Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:
>>> On 2022-05-05 06:09, Christian König wrote:
>>>
 Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:
> Problem:
> During hive reset caused by command timing out on a ring
> extra resets are generated by triggered by KFD which is
> unable to accesses registers on the resetting ASIC.
>
> Fix: Rework GPU reset to use a list of pending reset jobs
> such that the first reset jobs that actaully resets the entire
> reset domain will cancel all those pending redundant resets.
>
> This is in line with what we already do for redundant TDRs
> in scheduler code.

 Mhm, why exactly do you need the extra linked list then?

 Let's talk about that on our call today.
>>>
>>>
>>> Going to miss it as you know, and also this is the place to 
>>> discuss technical questions anyway so -
>>
>> Good point.
>>
>>> It's needed because those other resets are not time out handlers 
>>> that are governed by the scheduler
>>> but rather external resets that are triggered by such clients as 
>>> KFD, RAS and sysfs. Scheduler has no
>>> knowledge of them (and should not have) but they are serialized 
>>> into same wq as the TO handlers
>>> from the scheduler. It just happens that TO triggered reset 
>>> causes in turn another reset (from KFD in
>>> this case) and we want to prevent this second reset from taking 
>>> place just as we want to avoid multiple
>>> TO resets to take place in scheduler code.
>>
>> Yeah, but why do you need multiple workers?
>>
>> You have a single worker for the GPU reset not triggered by the 
>> scheduler in you adev and cancel that at the end of the reset 
>> procedure.
>>
>> If anybody things it needs to trigger another reset while in reset 
>> (which is actually a small design bug separately) the reset will 
>> just be canceled in the same way we cancel the scheduler resets.
>>
>> Christian.
>
>
> Had this in mind at first but then I realized that each client 
> (RAS, KFD and sysfs) will want to fill his own data for the work 
> (see amdgpu_device_gpu_recover) - for XGMI hive each will want to 
> set his own adev (which is fine if you set a work per adev as you 
> suggest) but also each client might want (they all put NULL there 
> but in theory in the future) also set his own bad job value and 
> here you might have a collision.

 Yeah, but that is intentional. See when we have a job that needs to 
 be consumed by the reset handler and not overwritten or something.
>>>
>>>
>>> I am not sure why this is a requirement, multiple clients can decide 
>>> concurrently to trigger a reset for some reason (possibly independent 
>>> reasons) hence they cannot share same work struct to pass to it their 
>>> data.
>>
>> If those concurrent clients could detect that a reset was already in 
>> progress, you wouldn't need the complexity of multiple work structs 
>> being scheduled. You could simply return without triggering another 
>> reset.
> 
> 
> In my view main problem here with single work struct either at reset 
> domain level or even adev level is that in some cases we optimize resets 
> and don't really perform ASIC HW reset (see amdgpu_job_timedout with 
> soft recovery and skip_hw_reset in amdgpu_device_gpu_recover_imp for the 
> case the bad job does get signaled just before we start HW reset and we 
> just skip this). You can see that if many different reset sources share 
> same work struct what can happen is that the first to obtain the lock 
> you describe bellow might opt out from full HW reset because his bad job 

The problem is this "opting out" of reset--meaning that the client didn't do
the complete recovery work, and were too quick to call for a GPU reset. So
they need to fix this locally in their code.

Generally when a GPU reset is scheduled, it should proceed regardless,
and there shouldn't be any opting out.

> did signal for example or because his hunged IP block was able to 
> recover through SW reset but in the meantime another reset source who 
> needed an actual HW reset just silently returned and we end up with 
> unhandled reset request. True that today this happens only to job 
> timeout reset sources that are handled form within the scheduler and 
> won't use this single work struct but no one prevents a future case for 
> this to happen and also, if we actually want to unify scheduler time out 
> handlers within reset 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky



On 2022-05-05 15:49, Felix Kuehling wrote:


Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky:


On 2022-05-05 11:06, Christian König wrote:

Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:


On 2022-05-05 09:23, Christian König wrote:

Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to 
discuss technical questions anyway so -


Good point.

It's needed because those other resets are not time out handlers 
that are governed by the scheduler
but rather external resets that are triggered by such clients as 
KFD, RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are serialized 
into same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset 
causes in turn another reset (from KFD in
this case) and we want to prevent this second reset from taking 
place just as we want to avoid multiple

TO resets to take place in scheduler code.


Yeah, but why do you need multiple workers?

You have a single worker for the GPU reset not triggered by the 
scheduler in you adev and cancel that at the end of the reset 
procedure.


If anybody things it needs to trigger another reset while in reset 
(which is actually a small design bug separately) the reset will 
just be canceled in the same way we cancel the scheduler resets.


Christian.



Had this in mind at first but then I realized that each client 
(RAS, KFD and sysfs) will want to fill his own data for the work 
(see amdgpu_device_gpu_recover) - for XGMI hive each will want to 
set his own adev (which is fine if you set a work per adev as you 
suggest) but also each client might want (they all put NULL there 
but in theory in the future) also set his own bad job value and 
here you might have a collision.


Yeah, but that is intentional. See when we have a job that needs to 
be consumed by the reset handler and not overwritten or something.



I am not sure why this is a requirement, multiple clients can decide 
concurrently to trigger a reset for some reason (possibly independent 
reasons) hence they cannot share same work struct to pass to it their 
data.


If those concurrent clients could detect that a reset was already in 
progress, you wouldn't need the complexity of multiple work structs 
being scheduled. You could simply return without triggering another 
reset.



In my view main problem here with single work struct either at reset 
domain level or even adev level is that in some cases we optimize resets 
and don't really perform ASIC HW reset (see amdgpu_job_timedout with 
soft recovery and skip_hw_reset in amdgpu_device_gpu_recover_imp for the 
case the bad job does get signaled just before we start HW reset and we 
just skip this). You can see that if many different reset sources share 
same work struct what can happen is that the first to obtain the lock 
you describe bellow might opt out from full HW reset because his bad job 
did signal for example or because his hunged IP block was able to 
recover through SW reset but in the meantime another reset source who 
needed an actual HW reset just silently returned and we end up with 
unhandled reset request. True that today this happens only to job 
timeout reset sources that are handled form within the scheduler and 
won't use this single work struct but no one prevents a future case for 
this to happen and also, if we actually want to unify scheduler time out 
handlers within reset domain (which seems to me the right design 
approach) we won't be able to use just one work struct for this reason 
anyway.


Andrey




I'd put the reset work struct into the reset_domain struct. That way 
you'd have exactly one worker for the reset domain. You could 
implement a lock-less scheme to decide whether you need to schedule a 
reset, e.g. using an atomic counter in the shared work struct that 
gets incremented when a client wants to trigger a reset 
(atomic_add_return). If that counter is exactly 1 after incrementing, 
you need to fill in the rest of the work struct and schedule the work. 
Otherwise, it's already scheduled (or another client is in the process 
of scheduling it) and you just return. When the worker finishes (after 
confirming a successful reset), it resets the counter to 0, so the 
next client requesting a reset will schedule the worker again.


Regards,
  Felix

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Felix Kuehling



Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky:


On 2022-05-05 11:06, Christian König wrote:

Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:


On 2022-05-05 09:23, Christian König wrote:

Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to 
discuss technical questions anyway so -


Good point.

It's needed because those other resets are not time out handlers 
that are governed by the scheduler
but rather external resets that are triggered by such clients as 
KFD, RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are serialized 
into same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset causes 
in turn another reset (from KFD in
this case) and we want to prevent this second reset from taking 
place just as we want to avoid multiple

TO resets to take place in scheduler code.


Yeah, but why do you need multiple workers?

You have a single worker for the GPU reset not triggered by the 
scheduler in you adev and cancel that at the end of the reset 
procedure.


If anybody things it needs to trigger another reset while in reset 
(which is actually a small design bug separately) the reset will 
just be canceled in the same way we cancel the scheduler resets.


Christian.



Had this in mind at first but then I realized that each client (RAS, 
KFD and sysfs) will want to fill his own data for the work (see 
amdgpu_device_gpu_recover) - for XGMI hive each will want to set his 
own adev (which is fine if you set a work per adev as you suggest) 
but also each client might want (they all put NULL there but in 
theory in the future) also set his own bad job value and here you 
might have a collision.


Yeah, but that is intentional. See when we have a job that needs to 
be consumed by the reset handler and not overwritten or something.



I am not sure why this is a requirement, multiple clients can decide 
concurrently to trigger a reset for some reason (possibly independent 
reasons) hence they cannot share same work struct to pass to it their 
data.


If those concurrent clients could detect that a reset was already in 
progress, you wouldn't need the complexity of multiple work structs 
being scheduled. You could simply return without triggering another reset.


I'd put the reset work struct into the reset_domain struct. That way 
you'd have exactly one worker for the reset domain. You could implement 
a lock-less scheme to decide whether you need to schedule a reset, e.g. 
using an atomic counter in the shared work struct that gets incremented 
when a client wants to trigger a reset (atomic_add_return). If that 
counter is exactly 1 after incrementing, you need to fill in the rest of 
the work struct and schedule the work. Otherwise, it's already scheduled 
(or another client is in the process of scheduling it) and you just 
return. When the worker finishes (after confirming a successful reset), 
it resets the counter to 0, so the next client requesting a reset will 
schedule the worker again.


Regards,
  Felix








Additional to that keep in mind that you can't allocate any memory 
before or during the GPU reset nor wait for the reset to complete (so 
you can't allocate anything on the stack either).



There is no dynamic allocation here, regarding stack allocations - we 
do it all the time when we call functions, even during GPU resets, how 
on stack allocation of work struct in amdgpu_device_gpu_recover is 
different from any other local variable we allocate in any function we 
call ?


I am also not sure why it's not allowed to wait for reset to complete 
? Also, see in amdgpu_ras_do_recovery and gpu_recover_get (debugfs) - 
the caller expects the reset to complete before he returns. I can 
probably work around it in RAS code by calling 
atomic_set(>in_recovery, 0)  from some callback within actual 
reset function but regarding sysfs it actually expects a result 
returned indicating whether the call was successful or not.


Andrey




I don't think that concept you try here will work.

Regards,
Christian.

Also in general seems to me it's cleaner approach where this logic 
(the work items) are held and handled in reset_domain and are not 
split in each adev or any other entity. We might want in the future 
to even move the scheduler handling into reset 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky



On 2022-05-05 11:06, Christian König wrote:

Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:


On 2022-05-05 09:23, Christian König wrote:

Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to discuss 
technical questions anyway so -


Good point.

It's needed because those other resets are not time out handlers 
that are governed by the scheduler
but rather external resets that are triggered by such clients as 
KFD, RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are serialized 
into same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset causes 
in turn another reset (from KFD in
this case) and we want to prevent this second reset from taking 
place just as we want to avoid multiple

TO resets to take place in scheduler code.


Yeah, but why do you need multiple workers?

You have a single worker for the GPU reset not triggered by the 
scheduler in you adev and cancel that at the end of the reset 
procedure.


If anybody things it needs to trigger another reset while in reset 
(which is actually a small design bug separately) the reset will 
just be canceled in the same way we cancel the scheduler resets.


Christian.



Had this in mind at first but then I realized that each client (RAS, 
KFD and sysfs) will want to fill his own data for the work (see 
amdgpu_device_gpu_recover) - for XGMI hive each will want to set his 
own adev (which is fine if you set a work per adev as you suggest) 
but also each client might want (they all put NULL there but in 
theory in the future) also set his own bad job value and here you 
might have a collision.


Yeah, but that is intentional. See when we have a job that needs to be 
consumed by the reset handler and not overwritten or something.



I am not sure why this is a requirement, multiple clients can decide 
concurrently to trigger a reset for some reason (possibly independent 
reasons) hence they cannot share same work struct to pass to it their data.






Additional to that keep in mind that you can't allocate any memory 
before or during the GPU reset nor wait for the reset to complete (so 
you can't allocate anything on the stack either).



There is no dynamic allocation here, regarding stack allocations - we do 
it all the time when we call functions, even during GPU resets, how on 
stack allocation of work struct in amdgpu_device_gpu_recover is 
different from any other local variable we allocate in any function we 
call ?


I am also not sure why it's not allowed to wait for reset to complete ? 
Also, see in amdgpu_ras_do_recovery and gpu_recover_get (debugfs) - the 
caller expects the reset to complete before he returns. I can probably 
work around it in RAS code by calling atomic_set(>in_recovery, 0)  
from some callback within actual reset function but regarding sysfs it 
actually expects a result returned indicating whether the call was 
successful or not.


Andrey




I don't think that concept you try here will work.

Regards,
Christian.

Also in general seems to me it's cleaner approach where this logic 
(the work items) are held and handled in reset_domain and are not 
split in each adev or any other entity. We might want in the future 
to even move the scheduler handling into reset domain since reset 
domain is supposed to be a generic things and not only or AMD.


Andrey






Andrey




Regards,
Christian.



Signed-off-by: Andrey Grodzovsky 
Tested-by: Bai Zoy 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    | 11 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  3 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 73 
+-

  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  3 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c  |  7 ++-
  8 files changed, 104 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 4264abc5604d..99efd8317547 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -109,6 +109,7 @@
  #include "amdgpu_fdinfo.h"
  #include "amdgpu_mca.h"
  #include "amdgpu_ras.h"
+#include "amdgpu_reset.h"
  

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Christian König

Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:


On 2022-05-05 09:23, Christian König wrote:

Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to discuss 
technical questions anyway so -


Good point.

It's needed because those other resets are not time out handlers 
that are governed by the scheduler
but rather external resets that are triggered by such clients as 
KFD, RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are serialized into 
same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset causes 
in turn another reset (from KFD in
this case) and we want to prevent this second reset from taking 
place just as we want to avoid multiple

TO resets to take place in scheduler code.


Yeah, but why do you need multiple workers?

You have a single worker for the GPU reset not triggered by the 
scheduler in you adev and cancel that at the end of the reset procedure.


If anybody things it needs to trigger another reset while in reset 
(which is actually a small design bug separately) the reset will just 
be canceled in the same way we cancel the scheduler resets.


Christian.



Had this in mind at first but then I realized that each client (RAS, 
KFD and sysfs) will want to fill his own data for the work (see 
amdgpu_device_gpu_recover) - for XGMI hive each will want to set his 
own adev (which is fine if you set a work per adev as you suggest) but 
also each client might want (they all put NULL there but in theory in 
the future) also set his own bad job value and here you might have a 
collision.


Yeah, but that is intentional. See when we have a job that needs to be 
consumed by the reset handler and not overwritten or something.


Additional to that keep in mind that you can't allocate any memory 
before or during the GPU reset nor wait for the reset to complete (so 
you can't allocate anything on the stack either).


I don't think that concept you try here will work.

Regards,
Christian.

Also in general seems to me it's cleaner approach where this logic 
(the work items) are held and handled in reset_domain and are not 
split in each adev or any other entity. We might want in the future to 
even move the scheduler handling into reset domain since reset domain 
is supposed to be a generic things and not only or AMD.


Andrey






Andrey




Regards,
Christian.



Signed-off-by: Andrey Grodzovsky 
Tested-by: Bai Zoy 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    | 11 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  3 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 73 
+-

  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  3 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c  |  7 ++-
  8 files changed, 104 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 4264abc5604d..99efd8317547 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -109,6 +109,7 @@
  #include "amdgpu_fdinfo.h"
  #include "amdgpu_mca.h"
  #include "amdgpu_ras.h"
+#include "amdgpu_reset.h"
    #define MAX_GPU_INSTANCE    16
  @@ -509,16 +510,6 @@ struct amdgpu_allowed_register_entry {
  bool grbm_indexed;
  };
  -enum amd_reset_method {
-    AMD_RESET_METHOD_NONE = -1,
-    AMD_RESET_METHOD_LEGACY = 0,
-    AMD_RESET_METHOD_MODE0,
-    AMD_RESET_METHOD_MODE1,
-    AMD_RESET_METHOD_MODE2,
-    AMD_RESET_METHOD_BACO,
-    AMD_RESET_METHOD_PCI,
-};
-
  struct amdgpu_video_codec_info {
  u32 codec_type;
  u32 max_width;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index e582f1044c0f..7fa82269c30f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5201,6 +5201,12 @@ int amdgpu_device_gpu_recover_imp(struct 
amdgpu_device *adev,

  }
    tmp_vram_lost_counter = 
atomic_read(&((adev)->vram_lost_counter));

+
+    /* Drop all pending resets since we will reset now anyway */
+    tmp_adev = list_first_entry(device_list_handle, struct 
amdgpu_device,

+   

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky



On 2022-05-05 09:23, Christian König wrote:

Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to discuss 
technical questions anyway so -


Good point.

It's needed because those other resets are not time out handlers that 
are governed by the scheduler
but rather external resets that are triggered by such clients as KFD, 
RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are serialized into 
same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset causes in 
turn another reset (from KFD in
this case) and we want to prevent this second reset from taking place 
just as we want to avoid multiple

TO resets to take place in scheduler code.


Yeah, but why do you need multiple workers?

You have a single worker for the GPU reset not triggered by the 
scheduler in you adev and cancel that at the end of the reset procedure.


If anybody things it needs to trigger another reset while in reset 
(which is actually a small design bug separately) the reset will just 
be canceled in the same way we cancel the scheduler resets.


Christian.



Had this in mind at first but then I realized that each client (RAS, KFD 
and sysfs) will want to fill his own data for the work (see 
amdgpu_device_gpu_recover) - for XGMI hive each will want to set his own 
adev (which is fine if you set a work per adev as you suggest) but also 
each client might want (they all put NULL there but in theory in the 
future) also set his own bad job value and here you might have a collision.
Also in general seems to me it's cleaner approach where this logic (the 
work items) are held and handled in reset_domain and are not split in 
each adev or any other entity. We might want in the future to even move 
the scheduler handling into reset domain since reset domain is supposed 
to be a generic things and not only or AMD.


Andrey






Andrey




Regards,
Christian.



Signed-off-by: Andrey Grodzovsky 
Tested-by: Bai Zoy 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    | 11 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  3 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 73 
+-

  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  3 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c  |  7 ++-
  8 files changed, 104 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 4264abc5604d..99efd8317547 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -109,6 +109,7 @@
  #include "amdgpu_fdinfo.h"
  #include "amdgpu_mca.h"
  #include "amdgpu_ras.h"
+#include "amdgpu_reset.h"
    #define MAX_GPU_INSTANCE    16
  @@ -509,16 +510,6 @@ struct amdgpu_allowed_register_entry {
  bool grbm_indexed;
  };
  -enum amd_reset_method {
-    AMD_RESET_METHOD_NONE = -1,
-    AMD_RESET_METHOD_LEGACY = 0,
-    AMD_RESET_METHOD_MODE0,
-    AMD_RESET_METHOD_MODE1,
-    AMD_RESET_METHOD_MODE2,
-    AMD_RESET_METHOD_BACO,
-    AMD_RESET_METHOD_PCI,
-};
-
  struct amdgpu_video_codec_info {
  u32 codec_type;
  u32 max_width;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index e582f1044c0f..7fa82269c30f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5201,6 +5201,12 @@ int amdgpu_device_gpu_recover_imp(struct 
amdgpu_device *adev,

  }
    tmp_vram_lost_counter = 
atomic_read(&((adev)->vram_lost_counter));

+
+    /* Drop all pending resets since we will reset now anyway */
+    tmp_adev = list_first_entry(device_list_handle, struct 
amdgpu_device,

+    reset_list);
+    amdgpu_reset_pending_list(tmp_adev->reset_domain);
+
  /* Actual ASIC resets if needed.*/
  /* Host driver will handle XGMI hive reset for SRIOV */
  if (amdgpu_sriov_vf(adev)) {
@@ -5296,7 +5302,7 @@ int amdgpu_device_gpu_recover_imp(struct 
amdgpu_device *adev,

  }
    struct amdgpu_recover_work_struct {
-    struct work_struct base;
+    struct amdgpu_reset_work_struct base;
  struct amdgpu_device 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Christian König

Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to discuss 
technical questions anyway so -


Good point.

It's needed because those other resets are not time out handlers that 
are governed by the scheduler
but rather external resets that are triggered by such clients as KFD, 
RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are serialized into 
same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset causes in 
turn another reset (from KFD in
this case) and we want to prevent this second reset from taking place 
just as we want to avoid multiple

TO resets to take place in scheduler code.


Yeah, but why do you need multiple workers?

You have a single worker for the GPU reset not triggered by the 
scheduler in you adev and cancel that at the end of the reset procedure.


If anybody things it needs to trigger another reset while in reset 
(which is actually a small design bug separately) the reset will just be 
canceled in the same way we cancel the scheduler resets.


Christian.



Andrey




Regards,
Christian.



Signed-off-by: Andrey Grodzovsky 
Tested-by: Bai Zoy 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    | 11 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  3 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 73 
+-

  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  3 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c  |  7 ++-
  8 files changed, 104 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 4264abc5604d..99efd8317547 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -109,6 +109,7 @@
  #include "amdgpu_fdinfo.h"
  #include "amdgpu_mca.h"
  #include "amdgpu_ras.h"
+#include "amdgpu_reset.h"
    #define MAX_GPU_INSTANCE    16
  @@ -509,16 +510,6 @@ struct amdgpu_allowed_register_entry {
  bool grbm_indexed;
  };
  -enum amd_reset_method {
-    AMD_RESET_METHOD_NONE = -1,
-    AMD_RESET_METHOD_LEGACY = 0,
-    AMD_RESET_METHOD_MODE0,
-    AMD_RESET_METHOD_MODE1,
-    AMD_RESET_METHOD_MODE2,
-    AMD_RESET_METHOD_BACO,
-    AMD_RESET_METHOD_PCI,
-};
-
  struct amdgpu_video_codec_info {
  u32 codec_type;
  u32 max_width;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index e582f1044c0f..7fa82269c30f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5201,6 +5201,12 @@ int amdgpu_device_gpu_recover_imp(struct 
amdgpu_device *adev,

  }
    tmp_vram_lost_counter = 
atomic_read(&((adev)->vram_lost_counter));

+
+    /* Drop all pending resets since we will reset now anyway */
+    tmp_adev = list_first_entry(device_list_handle, struct 
amdgpu_device,

+    reset_list);
+    amdgpu_reset_pending_list(tmp_adev->reset_domain);
+
  /* Actual ASIC resets if needed.*/
  /* Host driver will handle XGMI hive reset for SRIOV */
  if (amdgpu_sriov_vf(adev)) {
@@ -5296,7 +5302,7 @@ int amdgpu_device_gpu_recover_imp(struct 
amdgpu_device *adev,

  }
    struct amdgpu_recover_work_struct {
-    struct work_struct base;
+    struct amdgpu_reset_work_struct base;
  struct amdgpu_device *adev;
  struct amdgpu_job *job;
  int ret;
@@ -5304,7 +5310,7 @@ struct amdgpu_recover_work_struct {
    static void amdgpu_device_queue_gpu_recover_work(struct 
work_struct *work)

  {
-    struct amdgpu_recover_work_struct *recover_work = 
container_of(work, struct amdgpu_recover_work_struct, base);
+    struct amdgpu_recover_work_struct *recover_work = 
container_of(work, struct amdgpu_recover_work_struct, base.base.work);
    recover_work->ret = 
amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);

  }
@@ -5316,12 +5322,15 @@ int amdgpu_device_gpu_recover(struct 
amdgpu_device *adev,

  {
  struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
job};

  -    INIT_WORK(, amdgpu_device_queue_gpu_recover_work);
+    INIT_DELAYED_WORK(, 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky

On 2022-05-05 06:09, Christian König wrote:


Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.



Going to miss it as you know, and also this is the place to discuss 
technical questions anyway so -
It's needed because those other resets are not time out handlers that 
are governed by the scheduler
but rather external resets that are triggered by such clients as KFD, 
RAS and sysfs. Scheduler has no
knowledge of them (and should not have) but they are serialized into 
same wq as the TO handlers
from the scheduler. It just happens that TO triggered reset causes in 
turn another reset (from KFD in
this case) and we want to prevent this second reset from taking place 
just as we want to avoid multiple

TO resets to take place in scheduler code.

Andrey




Regards,
Christian.



Signed-off-by: Andrey Grodzovsky 
Tested-by: Bai Zoy 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    | 11 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  3 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 73 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  3 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c  |  7 ++-
  8 files changed, 104 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 4264abc5604d..99efd8317547 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -109,6 +109,7 @@
  #include "amdgpu_fdinfo.h"
  #include "amdgpu_mca.h"
  #include "amdgpu_ras.h"
+#include "amdgpu_reset.h"
    #define MAX_GPU_INSTANCE    16
  @@ -509,16 +510,6 @@ struct amdgpu_allowed_register_entry {
  bool grbm_indexed;
  };
  -enum amd_reset_method {
-    AMD_RESET_METHOD_NONE = -1,
-    AMD_RESET_METHOD_LEGACY = 0,
-    AMD_RESET_METHOD_MODE0,
-    AMD_RESET_METHOD_MODE1,
-    AMD_RESET_METHOD_MODE2,
-    AMD_RESET_METHOD_BACO,
-    AMD_RESET_METHOD_PCI,
-};
-
  struct amdgpu_video_codec_info {
  u32 codec_type;
  u32 max_width;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index e582f1044c0f..7fa82269c30f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5201,6 +5201,12 @@ int amdgpu_device_gpu_recover_imp(struct 
amdgpu_device *adev,

  }
    tmp_vram_lost_counter = 
atomic_read(&((adev)->vram_lost_counter));

+
+    /* Drop all pending resets since we will reset now anyway */
+    tmp_adev = list_first_entry(device_list_handle, struct 
amdgpu_device,

+    reset_list);
+    amdgpu_reset_pending_list(tmp_adev->reset_domain);
+
  /* Actual ASIC resets if needed.*/
  /* Host driver will handle XGMI hive reset for SRIOV */
  if (amdgpu_sriov_vf(adev)) {
@@ -5296,7 +5302,7 @@ int amdgpu_device_gpu_recover_imp(struct 
amdgpu_device *adev,

  }
    struct amdgpu_recover_work_struct {
-    struct work_struct base;
+    struct amdgpu_reset_work_struct base;
  struct amdgpu_device *adev;
  struct amdgpu_job *job;
  int ret;
@@ -5304,7 +5310,7 @@ struct amdgpu_recover_work_struct {
    static void amdgpu_device_queue_gpu_recover_work(struct 
work_struct *work)

  {
-    struct amdgpu_recover_work_struct *recover_work = 
container_of(work, struct amdgpu_recover_work_struct, base);
+    struct amdgpu_recover_work_struct *recover_work = 
container_of(work, struct amdgpu_recover_work_struct, base.base.work);
    recover_work->ret = 
amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);

  }
@@ -5316,12 +5322,15 @@ int amdgpu_device_gpu_recover(struct 
amdgpu_device *adev,

  {
  struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
job};

  -    INIT_WORK(, amdgpu_device_queue_gpu_recover_work);
+    INIT_DELAYED_WORK(, 
amdgpu_device_queue_gpu_recover_work);

+    INIT_LIST_HEAD();
    if (!amdgpu_reset_domain_schedule(adev->reset_domain, 
))

  return -EAGAIN;
  -    flush_work();
+    flush_delayed_work();
+
+ amdgpu_reset_domain_del_pendning_work(adev->reset_domain, );
    return work.ret;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index c80af0889773..ffddd419c351 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ 

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Christian König

Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.


Mhm, why exactly do you need the extra linked list then?

Let's talk about that on our call today.

Regards,
Christian.



Signed-off-by: Andrey Grodzovsky 
Tested-by: Bai Zoy 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h| 11 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  3 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 73 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  3 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  |  7 ++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c  |  7 ++-
  8 files changed, 104 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4264abc5604d..99efd8317547 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -109,6 +109,7 @@
  #include "amdgpu_fdinfo.h"
  #include "amdgpu_mca.h"
  #include "amdgpu_ras.h"
+#include "amdgpu_reset.h"
  
  #define MAX_GPU_INSTANCE		16
  
@@ -509,16 +510,6 @@ struct amdgpu_allowed_register_entry {

bool grbm_indexed;
  };
  
-enum amd_reset_method {

-   AMD_RESET_METHOD_NONE = -1,
-   AMD_RESET_METHOD_LEGACY = 0,
-   AMD_RESET_METHOD_MODE0,
-   AMD_RESET_METHOD_MODE1,
-   AMD_RESET_METHOD_MODE2,
-   AMD_RESET_METHOD_BACO,
-   AMD_RESET_METHOD_PCI,
-};
-
  struct amdgpu_video_codec_info {
u32 codec_type;
u32 max_width;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e582f1044c0f..7fa82269c30f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5201,6 +5201,12 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device 
*adev,
}
  
  	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));

+
+   /* Drop all pending resets since we will reset now anyway */
+   tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
+   reset_list);
+   amdgpu_reset_pending_list(tmp_adev->reset_domain);
+
/* Actual ASIC resets if needed.*/
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
@@ -5296,7 +5302,7 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device 
*adev,
  }
  
  struct amdgpu_recover_work_struct {

-   struct work_struct base;
+   struct amdgpu_reset_work_struct base;
struct amdgpu_device *adev;
struct amdgpu_job *job;
int ret;
@@ -5304,7 +5310,7 @@ struct amdgpu_recover_work_struct {
  
  static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)

  {
-   struct amdgpu_recover_work_struct *recover_work = container_of(work, 
struct amdgpu_recover_work_struct, base);
+   struct amdgpu_recover_work_struct *recover_work = container_of(work, 
struct amdgpu_recover_work_struct, base.base.work);
  
  	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);

  }
@@ -5316,12 +5322,15 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
  {
struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
  
-	INIT_WORK(, amdgpu_device_queue_gpu_recover_work);

+   INIT_DELAYED_WORK(, 
amdgpu_device_queue_gpu_recover_work);
+   INIT_LIST_HEAD();
  
  	if (!amdgpu_reset_domain_schedule(adev->reset_domain, ))

return -EAGAIN;
  
-	flush_work();

+   flush_delayed_work();
+
+   amdgpu_reset_domain_del_pendning_work(adev->reset_domain, );
  
  	return work.ret;

  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index c80af0889773..ffddd419c351 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -134,6 +134,9 @@ struct amdgpu_reset_domain 
*amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
atomic_set(_domain->in_gpu_reset, 0);
init_rwsem(_domain->sem);
  
+	INIT_LIST_HEAD(_domain->pending_works);

+   mutex_init(_domain->reset_lock);
+
return reset_domain;
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h

index 1949dbe28a86..863ec5720fc1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -24,7 +24,18 @@
  #ifndef __AMDGPU_RESET_H__
  #define 

[PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-04 Thread Andrey Grodzovsky
Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to use a list of pending reset jobs
such that the first reset jobs that actaully resets the entire
reset domain will cancel all those pending redundant resets.

This is in line with what we already do for redundant TDRs
in scheduler code.

Signed-off-by: Andrey Grodzovsky 
Tested-by: Bai Zoy 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h| 11 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 73 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  3 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  |  7 ++-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  |  7 ++-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c  |  7 ++-
 8 files changed, 104 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4264abc5604d..99efd8317547 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -109,6 +109,7 @@
 #include "amdgpu_fdinfo.h"
 #include "amdgpu_mca.h"
 #include "amdgpu_ras.h"
+#include "amdgpu_reset.h"
 
 #define MAX_GPU_INSTANCE   16
 
@@ -509,16 +510,6 @@ struct amdgpu_allowed_register_entry {
bool grbm_indexed;
 };
 
-enum amd_reset_method {
-   AMD_RESET_METHOD_NONE = -1,
-   AMD_RESET_METHOD_LEGACY = 0,
-   AMD_RESET_METHOD_MODE0,
-   AMD_RESET_METHOD_MODE1,
-   AMD_RESET_METHOD_MODE2,
-   AMD_RESET_METHOD_BACO,
-   AMD_RESET_METHOD_PCI,
-};
-
 struct amdgpu_video_codec_info {
u32 codec_type;
u32 max_width;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e582f1044c0f..7fa82269c30f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5201,6 +5201,12 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device 
*adev,
}
 
tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
+
+   /* Drop all pending resets since we will reset now anyway */
+   tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
+   reset_list);
+   amdgpu_reset_pending_list(tmp_adev->reset_domain);
+
/* Actual ASIC resets if needed.*/
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
@@ -5296,7 +5302,7 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device 
*adev,
 }
 
 struct amdgpu_recover_work_struct {
-   struct work_struct base;
+   struct amdgpu_reset_work_struct base;
struct amdgpu_device *adev;
struct amdgpu_job *job;
int ret;
@@ -5304,7 +5310,7 @@ struct amdgpu_recover_work_struct {
 
 static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
 {
-   struct amdgpu_recover_work_struct *recover_work = container_of(work, 
struct amdgpu_recover_work_struct, base);
+   struct amdgpu_recover_work_struct *recover_work = container_of(work, 
struct amdgpu_recover_work_struct, base.base.work);
 
recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, 
recover_work->job);
 }
@@ -5316,12 +5322,15 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
 {
struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
 
-   INIT_WORK(, amdgpu_device_queue_gpu_recover_work);
+   INIT_DELAYED_WORK(, 
amdgpu_device_queue_gpu_recover_work);
+   INIT_LIST_HEAD();
 
if (!amdgpu_reset_domain_schedule(adev->reset_domain, ))
return -EAGAIN;
 
-   flush_work();
+   flush_delayed_work();
+
+   amdgpu_reset_domain_del_pendning_work(adev->reset_domain, );
 
return work.ret;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index c80af0889773..ffddd419c351 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -134,6 +134,9 @@ struct amdgpu_reset_domain 
*amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
atomic_set(_domain->in_gpu_reset, 0);
init_rwsem(_domain->sem);
 
+   INIT_LIST_HEAD(_domain->pending_works);
+   mutex_init(_domain->reset_lock);
+
return reset_domain;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 1949dbe28a86..863ec5720fc1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -24,7 +24,18 @@
 #ifndef __AMDGPU_RESET_H__
 #define __AMDGPU_RESET_H__
 
-#include "amdgpu.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct amdgpu_device;
+struct amdgpu_job;
+struct