amdgpu: add concurrent baco reset support for XGMI

Ma, Le Wed, 11 Dec 2019 04:18:31 -0800

[AMD Official Use Only - Internal Distribution Only]

I tried your new patches to run BACO for about 10 loops and the result looks 
positive, without observing enter/exit baco message failure again.


The time interval between BACO entries or exits in my environment was almost 
less than 10 us: max 36us, min 2us. I think it's safe enough according to the 
sample data we collected in both sides.

And it looks not necessary to continue using system_highpri_wq any more because 
we require all the nodes enter or exit at the same time, while do not mind how 
long the time interval is b/t enter and exit. The system_unbound_wq can satisfy 
our requirement here since it wakes different CPUs up to work at the same time.

Regards,
Ma Le

From: Grodzovsky, Andrey <andrey.grodzov...@amd.com>
Sent: Wednesday, December 11, 2019 3:56 AM
To: Ma, Le <le...@amd.com>; amd-gfx@lists.freedesktop.org; Zhou1, Tao 
<tao.zh...@amd.com>; Deucher, Alexander <alexander.deuc...@amd.com>; Li, Dennis 
<dennis...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>
Cc: Chen, Guchun <guchun.c...@amd.com>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for 
XGMI


I switched the workqueue we were using for xgmi_reset_work from 
system_highpri_wq to system_unbound_wq - the difference is that workers 
servicing the queue in system_unbound_wq are not bounded to specific CPU and so 
the reset jobs for each XGMI node are getting scheduled to different CPU while 
system_highpri_wq is a bounded work queue. I traced it as bellow for 10 
consecutive times and didn't see errors any more. Also the time diff between 
BACO entries or exits was never more then around 2 uS.

Please give this updated patchset a try

   kworker/u16:2-57    [004] ...1   243.276312: trace_code: func: 
vega20_baco_set_state, line 91 <----- - Before BEACO enter
           <...>-60    [007] ...1   243.276312: trace_code: func: 
vega20_baco_set_state, line 91 <----- - Before BEACO enter
   kworker/u16:2-57    [004] ...1   243.276384: trace_code: func: 
vega20_baco_set_state, line 105 <----- - After BEACO enter done
           <...>-60    [007] ...1   243.276392: trace_code: func: 
vega20_baco_set_state, line 105 <----- - After BEACO enter done
   kworker/u16:3-60    [007] ...1   243.276397: trace_code: func: 
vega20_baco_set_state, line 108 <----- - Before BEACO exit
   kworker/u16:2-57    [004] ...1   243.276399: trace_code: func: 
vega20_baco_set_state, line 108 <----- - Before BEACO exit
   kworker/u16:3-60    [007] ...1   243.288067: trace_code: func: 
vega20_baco_set_state, line 114 <----- - After BEACO exit done
   kworker/u16:2-57    [004] ...1   243.295624: trace_code: func: 
vega20_baco_set_state, line 114 <----- - After BEACO exit done

Andrey
On 12/9/19 9:45 PM, Ma, Le wrote:

[AMD Official Use Only - Internal Distribution Only]

I'm fine with your solution if synchronization time interval satisfies BACO 
requirements and loop test can pass on XGMI system.

Regards,
Ma Le

From: Grodzovsky, Andrey 
<andrey.grodzov...@amd.com><mailto:andrey.grodzov...@amd.com>
Sent: Monday, December 9, 2019 11:52 PM
To: Ma, Le <le...@amd.com><mailto:le...@amd.com>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Zhou1, Tao 
<tao.zh...@amd.com><mailto:tao.zh...@amd.com>; Deucher, Alexander 
<alexander.deuc...@amd.com><mailto:alexander.deuc...@amd.com>; Li, Dennis 
<dennis...@amd.com><mailto:dennis...@amd.com>; Zhang, Hawking 
<hawking.zh...@amd.com><mailto:hawking.zh...@amd.com>
Cc: Chen, Guchun <guchun.c...@amd.com><mailto:guchun.c...@amd.com>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for 
XGMI


Thanks a lot Ma for trying - I think I have to have my own system to debug this 
so I will keep trying enabling XGMI - i still think the is the right and the 
generic solution for multiple nodes reset synchronization and in fact the 
barrier should also be used for synchronizing PSP mode 1 XGMI reset too.

Andrey
On 12/9/19 6:34 AM, Ma, Le wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

I tried your patches on my 2P XGMI platform. The baco can work at most time, 
and randomly got following error:
[ 1701.542298] amdgpu: [powerplay] Failed to send message 0x25, response 0x0

This error usually means some sync issue exist for xgmi baco case. Feel free to 
debug your patches on my XGMI platform.

Regards,
Ma Le

From: Grodzovsky, Andrey 
<andrey.grodzov...@amd.com><mailto:andrey.grodzov...@amd.com>
Sent: Saturday, December 7, 2019 5:51 AM
To: Ma, Le <le...@amd.com><mailto:le...@amd.com>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Zhou1, Tao 
<tao.zh...@amd.com><mailto:tao.zh...@amd.com>; Deucher, Alexander 
<alexander.deuc...@amd.com><mailto:alexander.deuc...@amd.com>; Li, Dennis 
<dennis...@amd.com><mailto:dennis...@amd.com>; Zhang, Hawking 
<hawking.zh...@amd.com><mailto:hawking.zh...@amd.com>
Cc: Chen, Guchun <guchun.c...@amd.com><mailto:guchun.c...@amd.com>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for 
XGMI


Hey Ma, attached a solution - it's just compiled as I still can't make my XGMI 
setup work (with bridge connected only one device is visible to the system 
while the other is not). Please try it on your system if you have a chance.

Andrey
On 12/4/19 10:14 PM, Ma, Le wrote:

AFAIK it's enough for even single one node in the hive to to fail the enter the 
BACO state on time to fail the entire hive reset procedure, no ?
[Le]: Yeah, agree that. I've been thinking that make all nodes entering baco 
simultaneously can reduce the possibility of node failure to enter/exit BACO 
risk. For example, in an XGMI hive with 8 nodes, the total time interval of 8 
nodes enter/exit BACO on 8 CPUs is less than the interval that 8 nodes enter 
BACO serially and exit BACO serially depending on one CPU with yield 
capability. This interval is usually strict for BACO feature itself. Anyway, we 
need more looping test later on any method we will choose.

Any way - I see our discussion blocks your entire patch set - I think you can 
go ahead and commit yours way (I think you got an RB from Hawking) and I will 
look then and see if I can implement my method and if it works will just revert 
your patch.

[Le]: OK, fine.

Andrey

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI

Reply via email to