[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-04-06 Thread David Fong via gem5-users
 |   0  0.00%  
0.25% |   1  0.06%  0.31% |   1  0.06%  0.38% | 
  0  0.00%  0.38% |   1  0.06%  0.44% | 
  0  0.00%  0.44% |   1  0.06%  0.50% |   0 
 0.00%  0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% |   0 
 0.00%  0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% |   0 
 0.00%  0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% |   0 
 0.00%  0.50% |   1  0.06%  0.56% |   0  0.00%  
0.56% |   0  0.00%  0.56% |   1  0.06%  
0.62% |   1  0.06%  0.69% |   0  0.00%  0.69% | 
  0  0.00%  0.69% |   1  0.06%  0.75% | 
  0  0.00%  0.75% |   1  0.06%  0.81% |   0 
 0.00%  0.81% |   1  0.06%  0.87% |   0  0.00%  
0.87% |   1  0.06%  0.94% |   0  0.00%  
0.94% |   1  0.06%  1.00% |   0  0.00%  1.00% | 
  0  0.00%  1.00% |   1  0.06%  1.06% | 
  0  0.00%  1.06% |   0  0.00%  1.06% |   1 
 0.06%  1.13% |   0  0.00%  1.13% |   1  0.06%  
1.19% |   0  0.00%  1.19% |   1  0.06%  
1.25% |   0  0.00%  1.25% |   0  0.00%  1.25% | 
  0  0.00%  1.25% |   0  0.00%  1.25% | 
  1  0.06%  1.31% |   0  0.00%  1.31% |   0 
 0.00%  1.31% |   0  0.00%  1.31% |   1  0.06%  
1.38% |   0  0.00%  1.38% |   1  0.06%  
1.44% |   0  0.00%  1.44% |   0  0.00%  1.44% | 
  1  0.06%  1.50% |   0  0.00%  1.50% | 
  1  0.06%  1.56% |   0  0.00%  1.56% |   0 
 0.00%  1.56% |   0  0.00%  1.56% # delay distribution for 
stores (Unspecified)

Thanks,

David


From: Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Sent: Wednesday, March 30, 2022 11:00 AM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>; Poremba, 
Matthew mailto:matthew.pore...@amd.com>>; Matt 
Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,
loadLatencyDist and storeLatencyDist are good stats for looking at the average 
latency experienced by GPU loads and stores respectively.

Thanks,
Srikant

From: David Fong mailto:da...@chronostech.com>>
Sent: Wednesday, March 30, 2022 10:35 AM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>; 
Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi,

Matt P has not replied in over a week and may be on vacation.
Can anyone else reply to my question regarding which stats to examine for 
reduced latency in stats.txt ?

Thanks,

David




From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Wednesday, March 23, 2022 11:23 AM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>; 
Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>>
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

Any feedback for my 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-04-06 Thread Bharadwaj, Srikant via gem5-users
% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% |   0 
 0.00%  0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% |   0 
 0.00%  0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  
0.50% |   0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% | 
  0  0.00%  0.50% |   0  0.00%  0.50% |   0 
 0.00%  0.50% |   1  0.06%  0.56% |   0  0.00%  
0.56% |   0  0.00%  0.56% |   1  0.06%  
0.62% |   1  0.06%  0.69% |   0  0.00%  0.69% | 
  0  0.00%  0.69% |   1  0.06%  0.75% | 
  0  0.00%  0.75% |   1  0.06%  0.81% |   0 
 0.00%  0.81% |   1  0.06%  0.87% |   0  0.00%  
0.87% |   1  0.06%  0.94% |   0  0.00%  
0.94% |   1  0.06%  1.00% |   0  0.00%  1.00% | 
  0  0.00%  1.00% |   1  0.06%  1.06% | 
  0  0.00%  1.06% |   0  0.00%  1.06% |   1 
 0.06%  1.13% |   0  0.00%  1.13% |   1  0.06%  
1.19% |   0  0.00%  1.19% |   1  0.06%  
1.25% |   0  0.00%  1.25% |   0  0.00%  1.25% | 
  0  0.00%  1.25% |   0  0.00%  1.25% | 
  1  0.06%  1.31% |   0  0.00%  1.31% |   0 
 0.00%  1.31% |   0  0.00%  1.31% |   1  0.06%  
1.38% |   0  0.00%  1.38% |   1  0.06%  
1.44% |   0  0.00%  1.44% |   0  0.00%  1.44% | 
  1  0.06%  1.50% |   0  0.00%  1.50% | 
  1  0.06%  1.56% |   0  0.00%  1.56% |   0 
 0.00%  1.56% |   0  0.00%  1.56% # delay distribution for 
stores (Unspecified)

Thanks,

David


From: Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Sent: Wednesday, March 30, 2022 11:00 AM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>; Poremba, 
Matthew mailto:matthew.pore...@amd.com>>; Matt 
Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,
loadLatencyDist and storeLatencyDist are good stats for looking at the average 
latency experienced by GPU loads and stores respectively.

Thanks,
Srikant

From: David Fong mailto:da...@chronostech.com>>
Sent: Wednesday, March 30, 2022 10:35 AM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>; 
Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi,

Matt P has not replied in over a week and may be on vacation.
Can anyone else reply to my question regarding which stats to examine for 
reduced latency in stats.txt ?

Thanks,

David




From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Wednesday, March 23, 2022 11:23 AM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>; 
Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>>
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

Any feedback for my question below regarding stats (stats.txt) to check for 
overall improvements due to reduced latency?

Thanks,

David

From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Monday, March 21, 2022 9:35 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-04-06 Thread David Fong via gem5-users
% |   1  0.06%  
0.62% |   1  0.06%  0.69% |   0  0.00%  0.69% | 
  0  0.00%  0.69% |   1  0.06%  0.75% | 
  0  0.00%  0.75% |   1  0.06%  0.81% |   0 
 0.00%  0.81% |   1  0.06%  0.87% |   0  0.00%  
0.87% |   1  0.06%  0.94% |   0  0.00%  
0.94% |   1  0.06%  1.00% |   0  0.00%  1.00% | 
  0  0.00%  1.00% |   1  0.06%  1.06% | 
  0  0.00%  1.06% |   0  0.00%  1.06% |   1 
 0.06%  1.13% |   0  0.00%  1.13% |   1  0.06%  
1.19% |   0  0.00%  1.19% |   1  0.06%  
1.25% |   0  0.00%  1.25% |   0  0.00%  1.25% | 
  0  0.00%  1.25% |   0  0.00%  1.25% | 
  1  0.06%  1.31% |   0  0.00%  1.31% |   0 
 0.00%  1.31% |   0  0.00%  1.31% |   1  0.06%  
1.38% |   0  0.00%  1.38% |   1  0.06%  
1.44% |   0  0.00%  1.44% |   0  0.00%  1.44% | 
  1  0.06%  1.50% |   0  0.00%  1.50% | 
  1  0.06%  1.56% |   0  0.00%  1.56% |   0 
 0.00%  1.56% |   0  0.00%  1.56% # delay distribution for 
stores (Unspecified)

Thanks,

David


From: Bharadwaj, Srikant 
Sent: Wednesday, March 30, 2022 11:00 AM
To: David Fong ; gem5 users mailing list 
; Poremba, Matthew ; Matt 
Sinclair 
Cc: Kyle Roarty 
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,
loadLatencyDist and storeLatencyDist are good stats for looking at the average 
latency experienced by GPU loads and stores respectively.

Thanks,
Srikant

From: David Fong mailto:da...@chronostech.com>>
Sent: Wednesday, March 30, 2022 10:35 AM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>; 
Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi,

Matt P has not replied in over a week and may be on vacation.
Can anyone else reply to my question regarding which stats to examine for 
reduced latency in stats.txt ?

Thanks,

David




From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Wednesday, March 23, 2022 11:23 AM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>; 
Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>>
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

Any feedback for my question below regarding stats (stats.txt) to check for 
overall improvements due to reduced latency?

Thanks,

David

From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Monday, March 21, 2022 9:35 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>>
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

When I tried the
--reg-alloc-policy=dynamic
a few runs did not improve and in fact got worse.
For now, I will not use this option.
Maybe the driver is not optimizing for this release.

I did update my runs to use

--gpu-to-dir-latency 100  (instead of 120)
--TCC_latency 12 (instead of 16)

And saw some with positive improvements and some with negative improvements.
But overall positive.

To determine the  improvement, I used the stats.txt and picked "allLatencyDist".
I was told to not use individual "CUsX latencies" since it's too focused on one 
CUs and one should look at the big picture.

system.cpu3.allLatencyDist::mean 91121342.881356

I choose the "allLatencyDist::mean" because it had similar % number as 
"storeLatency" and "loadLatency".
The sim times didn't complete earlier even with shorter latency so I decided to 
choose the overall latency.

Which stats do you think should improve overall ?

Thanks,

David


From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 2:10 PM
To: David Fong mailto:da...@chronostech.com>>; Matt 
Sinclair mailto:sincl...@cs.wisc.

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-04-01 Thread David Fong via gem5-users
Thanks Srikant for your reply.
I saw most of the tests showed the value of  StoreLatenyDist:mean to improve 
(reduce)  with reduced latency.
A few tests showed slightly increased latency.
I would expect all tests to show improvement (reduced latency).

Is there some explanation for that ?
Are there some inaccuracies of the model or driver ?

David


From: Bharadwaj, Srikant 
Sent: Wednesday, March 30, 2022 11:00 AM
To: David Fong ; gem5 users mailing list 
; Poremba, Matthew ; Matt 
Sinclair 
Cc: Kyle Roarty 
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,
loadLatencyDist and storeLatencyDist are good stats for looking at the average 
latency experienced by GPU loads and stores respectively.

Thanks,
Srikant

From: David Fong mailto:da...@chronostech.com>>
Sent: Wednesday, March 30, 2022 10:35 AM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>; 
Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi,

Matt P has not replied in over a week and may be on vacation.
Can anyone else reply to my question regarding which stats to examine for 
reduced latency in stats.txt ?

Thanks,

David




From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Wednesday, March 23, 2022 11:23 AM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>; 
Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>>
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

Any feedback for my question below regarding stats (stats.txt) to check for 
overall improvements due to reduced latency?

Thanks,

David

From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Monday, March 21, 2022 9:35 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>>
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

When I tried the
--reg-alloc-policy=dynamic
a few runs did not improve and in fact got worse.
For now, I will not use this option.
Maybe the driver is not optimizing for this release.

I did update my runs to use

--gpu-to-dir-latency 100  (instead of 120)
--TCC_latency 12 (instead of 16)

And saw some with positive improvements and some with negative improvements.
But overall positive.

To determine the  improvement, I used the stats.txt and picked "allLatencyDist".
I was told to not use individual "CUsX latencies" since it's too focused on one 
CUs and one should look at the big picture.

system.cpu3.allLatencyDist::mean 91121342.881356

I choose the "allLatencyDist::mean" because it had similar % number as 
"storeLatency" and "loadLatency".
The sim times didn't complete earlier even with shorter latency so I decided to 
choose the overall latency.

Which stats do you think should improve overall ?

Thanks,

David


From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 2:10 PM
To: David Fong mailto:da...@chronostech.com>>; Matt 
Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

These would be valid for both as they both use the same cache protocol files.  
I'm not very familiar with how dGPU is hacked up in SE mode to look like a 
dGPU...


-Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Thursday, March 17, 2022 9:57 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi Matt P,

Thanks for the tip on latency parameters.

Are these parameters valid ONLY for DGPU with VRAM or these apply to both DGPU 
and APU ?

David

From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 7:51 AM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; David 
Fong mailto:da...@chronostech.

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-30 Thread Bharadwaj, Srikant via gem5-users
Hi David,
loadLatencyDist and storeLatencyDist are good stats for looking at the average 
latency experienced by GPU loads and stores respectively.

Thanks,
Srikant

From: David Fong 
Sent: Wednesday, March 30, 2022 10:35 AM
To: gem5 users mailing list ; Poremba, Matthew 
; Matt Sinclair 
Cc: Kyle Roarty ; Bharadwaj, Srikant 

Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi,

Matt P has not replied in over a week and may be on vacation.
Can anyone else reply to my question regarding which stats to examine for 
reduced latency in stats.txt ?

Thanks,

David




From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Wednesday, March 23, 2022 11:23 AM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>; 
Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>>
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

Any feedback for my question below regarding stats (stats.txt) to check for 
overall improvements due to reduced latency?

Thanks,

David

From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Monday, March 21, 2022 9:35 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>>
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

When I tried the
--reg-alloc-policy=dynamic
a few runs did not improve and in fact got worse.
For now, I will not use this option.
Maybe the driver is not optimizing for this release.

I did update my runs to use

--gpu-to-dir-latency 100  (instead of 120)
--TCC_latency 12 (instead of 16)

And saw some with positive improvements and some with negative improvements.
But overall positive.

To determine the  improvement, I used the stats.txt and picked "allLatencyDist".
I was told to not use individual "CUsX latencies" since it's too focused on one 
CUs and one should look at the big picture.

system.cpu3.allLatencyDist::mean 91121342.881356

I choose the "allLatencyDist::mean" because it had similar % number as 
"storeLatency" and "loadLatency".
The sim times didn't complete earlier even with shorter latency so I decided to 
choose the overall latency.

Which stats do you think should improve overall ?

Thanks,

David


From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 2:10 PM
To: David Fong mailto:da...@chronostech.com>>; Matt 
Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

These would be valid for both as they both use the same cache protocol files.  
I'm not very familiar with how dGPU is hacked up in SE mode to look like a 
dGPU...


-Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Thursday, March 17, 2022 9:57 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi Matt P,

Thanks for the tip on latency parameters.

Are these parameters valid ONLY for DGPU with VRAM or these apply to both DGPU 
and APU ?

David

From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 7:51 AM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; David 
Fong mailto:da...@chronostech.com>>; gem5 users mailing 
list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

Hi David,


I don't think these are the parameters you want to be changing if you are 
trying to change the VRAM memory latency which it seems like you are based on 
the GDDR5 comment.  Those parameters are for the latency between CUs seeing a 
memory request and the request leaving the global memory pipeline, I believe.  
It doesn't really have anything to do with interconnect or the latency to VRAM 
memory.

I think the parameters you probably want are the latencies defined in the 
GPU_VIPER slicc files:

  

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-30 Thread David Fong via gem5-users
Hi,

Matt P has not replied in over a week and may be on vacation.
Can anyone else reply to my question regarding which stats to examine for 
reduced latency in stats.txt ?

Thanks,

David




From: David Fong via gem5-users 
Sent: Wednesday, March 23, 2022 11:23 AM
To: gem5 users mailing list ; Poremba, Matthew 
; Matt Sinclair 
Cc: Kyle Roarty ; Bharadwaj, Srikant 
; David Fong 
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

Any feedback for my question below regarding stats (stats.txt) to check for 
overall improvements due to reduced latency?

Thanks,

David

From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Monday, March 21, 2022 9:35 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>; David Fong 
mailto:da...@chronostech.com>>
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

When I tried the
--reg-alloc-policy=dynamic
a few runs did not improve and in fact got worse.
For now, I will not use this option.
Maybe the driver is not optimizing for this release.

I did update my runs to use

--gpu-to-dir-latency 100  (instead of 120)
--TCC_latency 12 (instead of 16)

And saw some with positive improvements and some with negative improvements.
But overall positive.

To determine the  improvement, I used the stats.txt and picked "allLatencyDist".
I was told to not use individual "CUsX latencies" since it's too focused on one 
CUs and one should look at the big picture.

system.cpu3.allLatencyDist::mean 91121342.881356

I choose the "allLatencyDist::mean" because it had similar % number as 
"storeLatency" and "loadLatency".
The sim times didn't complete earlier even with shorter latency so I decided to 
choose the overall latency.

Which stats do you think should improve overall ?

Thanks,

David


From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 2:10 PM
To: David Fong mailto:da...@chronostech.com>>; Matt 
Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

These would be valid for both as they both use the same cache protocol files.  
I'm not very familiar with how dGPU is hacked up in SE mode to look like a 
dGPU...


-Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Thursday, March 17, 2022 9:57 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi Matt P,

Thanks for the tip on latency parameters.

Are these parameters valid ONLY for DGPU with VRAM or these apply to both DGPU 
and APU ?

David

From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 7:51 AM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; David 
Fong mailto:da...@chronostech.com>>; gem5 users mailing 
list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

Hi David,


I don't think these are the parameters you want to be changing if you are 
trying to change the VRAM memory latency which it seems like you are based on 
the GDDR5 comment.  Those parameters are for the latency between CUs seeing a 
memory request and the request leaving the global memory pipeline, I believe.  
It doesn't really have anything to do with interconnect or the latency to VRAM 
memory.

I think the parameters you probably want are the latencies defined in the 
GPU_VIPER slicc files:

  *   l2_request_latency / l2_response_latency in GPU_VIPER-TCC.sm

It looks like in configs/ruby/GPU_VIPER.py there are some command line 
parameters for this which correspond to:

  *   --gpu-to-dir-latency / --TCC_latency


-Matt

From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Wednesday, March 16, 2022 10:41 PM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[C

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-23 Thread David Fong via gem5-users
Hi Matt P,

Any feedback for my question below regarding stats (stats.txt) to check for 
overall improvements due to reduced latency?

Thanks,

David

From: David Fong via gem5-users 
Sent: Monday, March 21, 2022 9:35 AM
To: Poremba, Matthew ; Matt Sinclair 
; gem5 users mailing list 
Cc: Kyle Roarty ; Bharadwaj, Srikant 
; David Fong 
Subject: [gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt P,

When I tried the
--reg-alloc-policy=dynamic
a few runs did not improve and in fact got worse.
For now, I will not use this option.
Maybe the driver is not optimizing for this release.

I did update my runs to use

--gpu-to-dir-latency 100  (instead of 120)
--TCC_latency 12 (instead of 16)

And saw some with positive improvements and some with negative improvements.
But overall positive.

To determine the  improvement, I used the stats.txt and picked "allLatencyDist".
I was told to not use individual "CUsX latencies" since it's too focused on one 
CUs and one should look at the big picture.

system.cpu3.allLatencyDist::mean 91121342.881356

I choose the "allLatencyDist::mean" because it had similar % number as 
"storeLatency" and "loadLatency".
The sim times didn't complete earlier even with shorter latency so I decided to 
choose the overall latency.

Which stats do you think should improve overall ?

Thanks,

David


From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 2:10 PM
To: David Fong mailto:da...@chronostech.com>>; Matt 
Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

These would be valid for both as they both use the same cache protocol files.  
I'm not very familiar with how dGPU is hacked up in SE mode to look like a 
dGPU...


-Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Thursday, March 17, 2022 9:57 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi Matt P,

Thanks for the tip on latency parameters.

Are these parameters valid ONLY for DGPU with VRAM or these apply to both DGPU 
and APU ?

David

From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 7:51 AM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; David 
Fong mailto:da...@chronostech.com>>; gem5 users mailing 
list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

Hi David,


I don't think these are the parameters you want to be changing if you are 
trying to change the VRAM memory latency which it seems like you are based on 
the GDDR5 comment.  Those parameters are for the latency between CUs seeing a 
memory request and the request leaving the global memory pipeline, I believe.  
It doesn't really have anything to do with interconnect or the latency to VRAM 
memory.

I think the parameters you probably want are the latencies defined in the 
GPU_VIPER slicc files:

  *   l2_request_latency / l2_response_latency in GPU_VIPER-TCC.sm

It looks like in configs/ruby/GPU_VIPER.py there are some command line 
parameters for this which correspond to:

  *   --gpu-to-dir-latency / --TCC_latency


-Matt

From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Wednesday, March 16, 2022 10:41 PM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Matt P or Srikant: can you please help David with the latency question?  You 
know the answers better than I do here.

Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Wednesday, March 16, 2022 5:47 PM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt S,

Thanks again for your quick reply with useful information.
I will rerun with -reg-alloc-policy=dynamic
in my mini regression to see If it makes a difference

As for LRN, I wo

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-21 Thread David Fong via gem5-users
Hi Matt P,

When I tried the
--reg-alloc-policy=dynamic
a few runs did not improve and in fact got worse.
For now, I will not use this option.
Maybe the driver is not optimizing for this release.

I did update my runs to use

--gpu-to-dir-latency 100  (instead of 120)
--TCC_latency 12 (instead of 16)

And saw some with positive improvements and some with negative improvements.
But overall positive.

To determine the  improvement, I used the stats.txt and picked "allLatencyDist".
I was told to not use individual "CUsX latencies" since it's too focused on one 
CUs and one should look at the big picture.

system.cpu3.allLatencyDist::mean 91121342.881356

I choose the "allLatencyDist::mean" because it had similar % number as 
"storeLatency" and "loadLatency".
The sim times didn't complete earlier even with shorter latency so I decided to 
choose the overall latency.

Which stats do you think should improve overall ?

Thanks,

David


From: Poremba, Matthew 
Sent: Thursday, March 17, 2022 2:10 PM
To: David Fong ; Matt Sinclair ; 
gem5 users mailing list 
Cc: Kyle Roarty ; Bharadwaj, Srikant 

Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

These would be valid for both as they both use the same cache protocol files.  
I'm not very familiar with how dGPU is hacked up in SE mode to look like a 
dGPU...


-Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Thursday, March 17, 2022 9:57 AM
To: Poremba, Matthew mailto:matthew.pore...@amd.com>>; 
Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 users 
mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi Matt P,

Thanks for the tip on latency parameters.

Are these parameters valid ONLY for DGPU with VRAM or these apply to both DGPU 
and APU ?

David

From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 7:51 AM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; David 
Fong mailto:da...@chronostech.com>>; gem5 users mailing 
list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

Hi David,


I don't think these are the parameters you want to be changing if you are 
trying to change the VRAM memory latency which it seems like you are based on 
the GDDR5 comment.  Those parameters are for the latency between CUs seeing a 
memory request and the request leaving the global memory pipeline, I believe.  
It doesn't really have anything to do with interconnect or the latency to VRAM 
memory.

I think the parameters you probably want are the latencies defined in the 
GPU_VIPER slicc files:

  *   l2_request_latency / l2_response_latency in GPU_VIPER-TCC.sm

It looks like in configs/ruby/GPU_VIPER.py there are some command line 
parameters for this which correspond to:

  *   --gpu-to-dir-latency / --TCC_latency


-Matt

From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Wednesday, March 16, 2022 10:41 PM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Matt P or Srikant: can you please help David with the latency question?  You 
know the answers better than I do here.

Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Wednesday, March 16, 2022 5:47 PM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt S,

Thanks again for your quick reply with useful information.
I will rerun with -reg-alloc-policy=dynamic
in my mini regression to see If it makes a difference

As for LRN, I won't make modifications to lrn_config.dnnmark
unless it's required to run additional DNN tests.
The 4 tests : test_fwd_softmax, test_bwd_softmax, test_fwd_pool, and 
test_bwd_bn are good enough for now.

For Matt S and Matt P,
Are these parameters for "mem_req_latency" and "mem_resp_latency" valid for 
both APU (Carrizo) and GPU (VEGA) ?
gem5/src/gpu-compute/GPU.py
mem_req_latency = Param.Int(40, "Latency for request from the cu to ruby. "\
"Represents the pipeline to reach the TCP "\
"and specified in GPU clock cycles")
mem_resp_latency = Param.Int(40, "Latency for responses from ruby to the "\
 "cu. Represents the pipeline between the "\
 "TCP and cu as well as TCP 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-17 Thread Poremba, Matthew via gem5-users
[AMD Official Use Only]

These would be valid for both as they both use the same cache protocol files.  
I'm not very familiar with how dGPU is hacked up in SE mode to look like a 
dGPU...


-Matt

From: David Fong 
Sent: Thursday, March 17, 2022 9:57 AM
To: Poremba, Matthew ; Matt Sinclair 
; gem5 users mailing list 
Cc: Kyle Roarty ; Bharadwaj, Srikant 

Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Hi Matt P,

Thanks for the tip on latency parameters.

Are these parameters valid ONLY for DGPU with VRAM or these apply to both DGPU 
and APU ?

David

From: Poremba, Matthew mailto:matthew.pore...@amd.com>>
Sent: Thursday, March 17, 2022 7:51 AM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; David 
Fong mailto:da...@chronostech.com>>; gem5 users mailing 
list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

Hi David,


I don't think these are the parameters you want to be changing if you are 
trying to change the VRAM memory latency which it seems like you are based on 
the GDDR5 comment.  Those parameters are for the latency between CUs seeing a 
memory request and the request leaving the global memory pipeline, I believe.  
It doesn't really have anything to do with interconnect or the latency to VRAM 
memory.

I think the parameters you probably want are the latencies defined in the 
GPU_VIPER slicc files:

  *   l2_request_latency / l2_response_latency in GPU_VIPER-TCC.sm

It looks like in configs/ruby/GPU_VIPER.py there are some command line 
parameters for this which correspond to:

  *   --gpu-to-dir-latency / --TCC_latency


-Matt

From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Wednesday, March 16, 2022 10:41 PM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Matt P or Srikant: can you please help David with the latency question?  You 
know the answers better than I do here.

Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Wednesday, March 16, 2022 5:47 PM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt S,

Thanks again for your quick reply with useful information.
I will rerun with -reg-alloc-policy=dynamic
in my mini regression to see If it makes a difference

As for LRN, I won't make modifications to lrn_config.dnnmark
unless it's required to run additional DNN tests.
The 4 tests : test_fwd_softmax, test_bwd_softmax, test_fwd_pool, and 
test_bwd_bn are good enough for now.

For Matt S and Matt P,
Are these parameters for "mem_req_latency" and "mem_resp_latency" valid for 
both APU (Carrizo) and GPU (VEGA) ?
gem5/src/gpu-compute/GPU.py
mem_req_latency = Param.Int(40, "Latency for request from the cu to ruby. "\
"Represents the pipeline to reach the TCP "\
"and specified in GPU clock cycles")
mem_resp_latency = Param.Int(40, "Latency for responses from ruby to the "\
 "cu. Represents the pipeline between the "\
 "TCP and cu as well as TCP data array "\
 "access. Specified in GPU clock cycles")
It seems like to me the GPU (VEGA) with dedicated memory (GDDR5) should be 
using a different parameter for its memory access latencies.
My company's IP could be used to reduce interconnect latencies for the APU and 
GPU and would to quantify this at system level with benchmarks.
We would like to determine if GPU can get performance boost with reduced memory 
access latencies.
Please confirm which memory latencies parameters to modify and use for GPU 
(VEGA).

Thanks,

David


From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Tuesday, March 15, 2022 1:08 PM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,

The dynamic register allocation policy allows the GPU to schedule as many 
wavefronts as there is register space on a CU.  By default, the original 
register allocator released with this GPU model ("simple") only allowed 1 
wavefront per CU at a time because the publicly available dependence modeling 
was fairly primitive.  However, this was not very realistic relative to how a 
real GPU performs, so my group has added 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-17 Thread David Fong via gem5-users
Hi Matt P,

Thanks for the tip on latency parameters.

Are these parameters valid ONLY for DGPU with VRAM or these apply to both DGPU 
and APU ?

David

From: Poremba, Matthew 
Sent: Thursday, March 17, 2022 7:51 AM
To: Matt Sinclair ; David Fong ; 
gem5 users mailing list 
Cc: Kyle Roarty ; Bharadwaj, Srikant 

Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn


[AMD Official Use Only]

Hi David,


I don't think these are the parameters you want to be changing if you are 
trying to change the VRAM memory latency which it seems like you are based on 
the GDDR5 comment.  Those parameters are for the latency between CUs seeing a 
memory request and the request leaving the global memory pipeline, I believe.  
It doesn't really have anything to do with interconnect or the latency to VRAM 
memory.

I think the parameters you probably want are the latencies defined in the 
GPU_VIPER slicc files:

  *   l2_request_latency / l2_response_latency in GPU_VIPER-TCC.sm

It looks like in configs/ruby/GPU_VIPER.py there are some command line 
parameters for this which correspond to:

  *   --gpu-to-dir-latency / --TCC_latency


-Matt

From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Wednesday, March 16, 2022 10:41 PM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>; Bharadwaj, Srikant 
mailto:srikant.bharad...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Matt P or Srikant: can you please help David with the latency question?  You 
know the answers better than I do here.

Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Wednesday, March 16, 2022 5:47 PM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt S,

Thanks again for your quick reply with useful information.
I will rerun with -reg-alloc-policy=dynamic
in my mini regression to see If it makes a difference

As for LRN, I won't make modifications to lrn_config.dnnmark
unless it's required to run additional DNN tests.
The 4 tests : test_fwd_softmax, test_bwd_softmax, test_fwd_pool, and 
test_bwd_bn are good enough for now.

For Matt S and Matt P,
Are these parameters for "mem_req_latency" and "mem_resp_latency" valid for 
both APU (Carrizo) and GPU (VEGA) ?
gem5/src/gpu-compute/GPU.py
mem_req_latency = Param.Int(40, "Latency for request from the cu to ruby. "\
"Represents the pipeline to reach the TCP "\
"and specified in GPU clock cycles")
mem_resp_latency = Param.Int(40, "Latency for responses from ruby to the "\
 "cu. Represents the pipeline between the "\
 "TCP and cu as well as TCP data array "\
 "access. Specified in GPU clock cycles")
It seems like to me the GPU (VEGA) with dedicated memory (GDDR5) should be 
using a different parameter for its memory access latencies.
My company's IP could be used to reduce interconnect latencies for the APU and 
GPU and would to quantify this at system level with benchmarks.
We would like to determine if GPU can get performance boost with reduced memory 
access latencies.
Please confirm which memory latencies parameters to modify and use for GPU 
(VEGA).

Thanks,

David


From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Tuesday, March 15, 2022 1:08 PM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,

The dynamic register allocation policy allows the GPU to schedule as many 
wavefronts as there is register space on a CU.  By default, the original 
register allocator released with this GPU model ("simple") only allowed 1 
wavefront per CU at a time because the publicly available dependence modeling 
was fairly primitive.  However, this was not very realistic relative to how a 
real GPU performs, so my group has added better dependence tracking support 
(more could probably still be done, but it reduced stalls by up to 42% relative 
to simple) and a register allocation scheme that allows multiple wavefronts to 
run concurrently per CU ("dynamic").

By default, the GPU model assumes that the simple policy is used unless 
otherwise specified.  I have a patch in progress to change that though: 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-17 Thread Poremba, Matthew via gem5-users
[AMD Official Use Only]

Hi David,


I don't think these are the parameters you want to be changing if you are 
trying to change the VRAM memory latency which it seems like you are based on 
the GDDR5 comment.  Those parameters are for the latency between CUs seeing a 
memory request and the request leaving the global memory pipeline, I believe.  
It doesn't really have anything to do with interconnect or the latency to VRAM 
memory.

I think the parameters you probably want are the latencies defined in the 
GPU_VIPER slicc files:

  *   l2_request_latency / l2_response_latency in GPU_VIPER-TCC.sm

It looks like in configs/ruby/GPU_VIPER.py there are some command line 
parameters for this which correspond to:

  *   --gpu-to-dir-latency / --TCC_latency


-Matt

From: Matt Sinclair 
Sent: Wednesday, March 16, 2022 10:41 PM
To: David Fong ; gem5 users mailing list 

Cc: Kyle Roarty ; Poremba, Matthew ; 
Bharadwaj, Srikant 
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

[CAUTION: External Email]
Matt P or Srikant: can you please help David with the latency question?  You 
know the answers better than I do here.

Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Wednesday, March 16, 2022 5:47 PM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt S,

Thanks again for your quick reply with useful information.
I will rerun with -reg-alloc-policy=dynamic
in my mini regression to see If it makes a difference

As for LRN, I won't make modifications to lrn_config.dnnmark
unless it's required to run additional DNN tests.
The 4 tests : test_fwd_softmax, test_bwd_softmax, test_fwd_pool, and 
test_bwd_bn are good enough for now.

For Matt S and Matt P,
Are these parameters for "mem_req_latency" and "mem_resp_latency" valid for 
both APU (Carrizo) and GPU (VEGA) ?
gem5/src/gpu-compute/GPU.py
mem_req_latency = Param.Int(40, "Latency for request from the cu to ruby. "\
"Represents the pipeline to reach the TCP "\
"and specified in GPU clock cycles")
mem_resp_latency = Param.Int(40, "Latency for responses from ruby to the "\
 "cu. Represents the pipeline between the "\
 "TCP and cu as well as TCP data array "\
 "access. Specified in GPU clock cycles")
It seems like to me the GPU (VEGA) with dedicated memory (GDDR5) should be 
using a different parameter for its memory access latencies.
My company's IP could be used to reduce interconnect latencies for the APU and 
GPU and would to quantify this at system level with benchmarks.
We would like to determine if GPU can get performance boost with reduced memory 
access latencies.
Please confirm which memory latencies parameters to modify and use for GPU 
(VEGA).

Thanks,

David


From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Tuesday, March 15, 2022 1:08 PM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,

The dynamic register allocation policy allows the GPU to schedule as many 
wavefronts as there is register space on a CU.  By default, the original 
register allocator released with this GPU model ("simple") only allowed 1 
wavefront per CU at a time because the publicly available dependence modeling 
was fairly primitive.  However, this was not very realistic relative to how a 
real GPU performs, so my group has added better dependence tracking support 
(more could probably still be done, but it reduced stalls by up to 42% relative 
to simple) and a register allocation scheme that allows multiple wavefronts to 
run concurrently per CU ("dynamic").

By default, the GPU model assumes that the simple policy is used unless 
otherwise specified.  I have a patch in progress to change that though: 
https://gem5-review.googlesource.com/c/public/gem5/+/57537.

Regardless, if applications are failing with the simple 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-16 Thread Matt Sinclair via gem5-users
Matt P or Srikant: can you please help David with the latency question?  You 
know the answers better than I do here.

Matt

From: David Fong 
Sent: Wednesday, March 16, 2022 5:47 PM
To: Matt Sinclair ; gem5 users mailing list 

Cc: Kyle Roarty ; Poremba, Matthew 
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt S,

Thanks again for your quick reply with useful information.
I will rerun with -reg-alloc-policy=dynamic
in my mini regression to see If it makes a difference

As for LRN, I won't make modifications to lrn_config.dnnmark
unless it's required to run additional DNN tests.
The 4 tests : test_fwd_softmax, test_bwd_softmax, test_fwd_pool, and 
test_bwd_bn are good enough for now.

For Matt S and Matt P,
Are these parameters for "mem_req_latency" and "mem_resp_latency" valid for 
both APU (Carrizo) and GPU (VEGA) ?
gem5/src/gpu-compute/GPU.py
mem_req_latency = Param.Int(40, "Latency for request from the cu to ruby. "\
"Represents the pipeline to reach the TCP "\
"and specified in GPU clock cycles")
mem_resp_latency = Param.Int(40, "Latency for responses from ruby to the "\
 "cu. Represents the pipeline between the "\
 "TCP and cu as well as TCP data array "\
 "access. Specified in GPU clock cycles")
It seems like to me the GPU (VEGA) with dedicated memory (GDDR5) should be 
using a different parameter for its memory access latencies.
My company's IP could be used to reduce interconnect latencies for the APU and 
GPU and would to quantify this at system level with benchmarks.
We would like to determine if GPU can get performance boost with reduced memory 
access latencies.
Please confirm which memory latencies parameters to modify and use for GPU 
(VEGA).

Thanks,

David


From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Tuesday, March 15, 2022 1:08 PM
To: David Fong mailto:da...@chronostech.com>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,

The dynamic register allocation policy allows the GPU to schedule as many 
wavefronts as there is register space on a CU.  By default, the original 
register allocator released with this GPU model ("simple") only allowed 1 
wavefront per CU at a time because the publicly available dependence modeling 
was fairly primitive.  However, this was not very realistic relative to how a 
real GPU performs, so my group has added better dependence tracking support 
(more could probably still be done, but it reduced stalls by up to 42% relative 
to simple) and a register allocation scheme that allows multiple wavefronts to 
run concurrently per CU ("dynamic").

By default, the GPU model assumes that the simple policy is used unless 
otherwise specified.  I have a patch in progress to change that though: 
https://gem5-review.googlesource.com/c/public/gem5/+/57537.

Regardless, if applications are failing with the simple register allocation 
scheme, I wouldn't expect a more complex scheme to fix the issue.  But I do 
strongly recommend you use the dynamic policy for all experiments - otherwise 
you are using a very simple, less realistic GPU model.

Setting all of that aside, I looked up the perror message you sent last night 
and it appears that happens when your physical machine has run out of memory 
(which means we can't do much to fix gem5, since the machine itself wouldn't 
allocate as much memory as you requested).  So, if you want to run LRN and 
can't run on a machine with more memory, one thing you could do is change the 
LRN config file to use smaller NCHW values (e.g., reduce the batch size, N, 
from 100 to something smaller that fits on your machine): 
https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/DNNMark/config_example/lrn_config.dnnmark#6.
  If you do this though, you will likely need to re-run the generate_cachefile 
to generate the MIOpen binaries for this different sized LRN.

Hope this helps,
Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Tuesday, March 15, 2022 2:58 PM
To: Matt Sinclair 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-16 Thread David Fong via gem5-users
Hi Matt S,

Thanks again for your quick reply with useful information.
I will rerun with -reg-alloc-policy=dynamic
in my mini regression to see If it makes a difference

As for LRN, I won't make modifications to lrn_config.dnnmark
unless it's required to run additional DNN tests.
The 4 tests : test_fwd_softmax, test_bwd_softmax, test_fwd_pool, and 
test_bwd_bn are good enough for now.

For Matt S and Matt P,
Are these parameters for "mem_req_latency" and "mem_resp_latency" valid for 
both APU (Carrizo) and GPU (VEGA) ?
gem5/src/gpu-compute/GPU.py
mem_req_latency = Param.Int(40, "Latency for request from the cu to ruby. "\
"Represents the pipeline to reach the TCP "\
"and specified in GPU clock cycles")
mem_resp_latency = Param.Int(40, "Latency for responses from ruby to the "\
 "cu. Represents the pipeline between the "\
 "TCP and cu as well as TCP data array "\
 "access. Specified in GPU clock cycles")
It seems like to me the GPU (VEGA) with dedicated memory (GDDR5) should be 
using a different parameter for its memory access latencies.
My company's IP could be used to reduce interconnect latencies for the APU and 
GPU and would to quantify this at system level with benchmarks.
We would like to determine if GPU can get performance boost with reduced memory 
access latencies.
Please confirm which memory latencies parameters to modify and use for GPU 
(VEGA).

Thanks,

David


From: Matt Sinclair 
Sent: Tuesday, March 15, 2022 1:08 PM
To: David Fong ; gem5 users mailing list 

Cc: Kyle Roarty ; Poremba, Matthew 
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,

The dynamic register allocation policy allows the GPU to schedule as many 
wavefronts as there is register space on a CU.  By default, the original 
register allocator released with this GPU model ("simple") only allowed 1 
wavefront per CU at a time because the publicly available dependence modeling 
was fairly primitive.  However, this was not very realistic relative to how a 
real GPU performs, so my group has added better dependence tracking support 
(more could probably still be done, but it reduced stalls by up to 42% relative 
to simple) and a register allocation scheme that allows multiple wavefronts to 
run concurrently per CU ("dynamic").

By default, the GPU model assumes that the simple policy is used unless 
otherwise specified.  I have a patch in progress to change that though: 
https://gem5-review.googlesource.com/c/public/gem5/+/57537.

Regardless, if applications are failing with the simple register allocation 
scheme, I wouldn't expect a more complex scheme to fix the issue.  But I do 
strongly recommend you use the dynamic policy for all experiments - otherwise 
you are using a very simple, less realistic GPU model.

Setting all of that aside, I looked up the perror message you sent last night 
and it appears that happens when your physical machine has run out of memory 
(which means we can't do much to fix gem5, since the machine itself wouldn't 
allocate as much memory as you requested).  So, if you want to run LRN and 
can't run on a machine with more memory, one thing you could do is change the 
LRN config file to use smaller NCHW values (e.g., reduce the batch size, N, 
from 100 to something smaller that fits on your machine): 
https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/DNNMark/config_example/lrn_config.dnnmark#6.
  If you do this though, you will likely need to re-run the generate_cachefile 
to generate the MIOpen binaries for this different sized LRN.

Hope this helps,
Matt

From: David Fong mailto:da...@chronostech.com>>
Sent: Tuesday, March 15, 2022 2:58 PM
To: Matt Sinclair mailto:sincl...@cs.wisc.edu>>; gem5 
users mailing list mailto:gem5-users@gem5.org>>
Cc: Kyle Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt S.,

Thanks for the detailed reply.

I looked at the link you sent me for the weekly run.

I see an additional parameter which I didn't use:

--reg-alloc-policy=dynamic

What does this do ?

I was able to run the two other tests you use in your 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-15 Thread Matt Sinclair via gem5-users
Hi David,

The dynamic register allocation policy allows the GPU to schedule as many 
wavefronts as there is register space on a CU.  By default, the original 
register allocator released with this GPU model ("simple") only allowed 1 
wavefront per CU at a time because the publicly available dependence modeling 
was fairly primitive.  However, this was not very realistic relative to how a 
real GPU performs, so my group has added better dependence tracking support 
(more could probably still be done, but it reduced stalls by up to 42% relative 
to simple) and a register allocation scheme that allows multiple wavefronts to 
run concurrently per CU ("dynamic").

By default, the GPU model assumes that the simple policy is used unless 
otherwise specified.  I have a patch in progress to change that though: 
https://gem5-review.googlesource.com/c/public/gem5/+/57537.

Regardless, if applications are failing with the simple register allocation 
scheme, I wouldn't expect a more complex scheme to fix the issue.  But I do 
strongly recommend you use the dynamic policy for all experiments - otherwise 
you are using a very simple, less realistic GPU model.

Setting all of that aside, I looked up the perror message you sent last night 
and it appears that happens when your physical machine has run out of memory 
(which means we can't do much to fix gem5, since the machine itself wouldn't 
allocate as much memory as you requested).  So, if you want to run LRN and 
can't run on a machine with more memory, one thing you could do is change the 
LRN config file to use smaller NCHW values (e.g., reduce the batch size, N, 
from 100 to something smaller that fits on your machine): 
https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/DNNMark/config_example/lrn_config.dnnmark#6.
  If you do this though, you will likely need to re-run the generate_cachefile 
to generate the MIOpen binaries for this different sized LRN.

Hope this helps,
Matt

From: David Fong 
Sent: Tuesday, March 15, 2022 2:58 PM
To: Matt Sinclair ; gem5 users mailing list 

Cc: Kyle Roarty ; Poremba, Matthew 
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt S.,

Thanks for the detailed reply.

I looked at the link you sent me for the weekly run.

I see an additional parameter which I didn't use:

--reg-alloc-policy=dynamic

What does this do ?

I was able to run the two other tests you use in your weekly runs : 
test_fwd_pool, test_bwd_bn
for CUs=4.

David


From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Monday, March 14, 2022 7:41 PM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>
Cc: David Fong mailto:da...@chronostech.com>>; Kyle 
Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,

I have not seen this mmap error before, and my initial guess was the mmap error 
is happening because you are trying to allocate more memory than we created 
when mmap'ing the inputs for the applications (we do this to speed up SE mode, 
because otherwise initializing arrays can take several hours).  However, the 
fact that it is failing in physical.cc and not in the application itself is 
throwing me off there.  Looking at where the failure is occurring, it seems the 
backing store code itself is failing here (from such a large allocation).  
Since the failure is with a C++ mmap call itself, that is perhaps more 
problematic - is "Cannot allocate memory" the failure from the perror() call on 
the line above the fatal() print?

Regarding the other question, and the failures more generally: we have never 
tested with > 64 CUs before, so certainly you are stressing the system and 
encountering different kinds of failures than we have seen previously.

In terms of applications, I had thought most/all of them passed previously, but 
we do not test each and every one all the time because this would make our 
weekly regressions run for a very long time.  You can see here: 
https://gem5.googlesource.com/public/gem5/+/refs/heads/develop/tests/weekly.sh#176
 which ones we run on a weekly basis.  I expect all of those to pass (although 
your comment seems to indicate that is not always true?).  Your issues are 
exposing that perhaps we need to test more of them beyond these 3 - perhaps on 
a quarterly basis or something though to avoid inflating the weekly runtime.  
Having said that, I have not run LRN in a long time, as some ML people told me 
that LRN was not widely used anymore.  But when I did run it, I do remember it 
requiring a large amount of memory - which squares with what you are seeing 
here.  I thought 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-15 Thread David Fong via gem5-users
Hi Matt S.,

Thanks for the detailed reply.

I looked at the link you sent me for the weekly run.

I see an additional parameter which I didn't use:

--reg-alloc-policy=dynamic

What does this do ?

I was able to run the two other tests you use in your weekly runs : 
test_fwd_pool, test_bwd_bn
for CUs=4.

David


From: Matt Sinclair 
Sent: Monday, March 14, 2022 7:41 PM
To: gem5 users mailing list 
Cc: David Fong ; Kyle Roarty ; 
Poremba, Matthew 
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,

I have not seen this mmap error before, and my initial guess was the mmap error 
is happening because you are trying to allocate more memory than we created 
when mmap'ing the inputs for the applications (we do this to speed up SE mode, 
because otherwise initializing arrays can take several hours).  However, the 
fact that it is failing in physical.cc and not in the application itself is 
throwing me off there.  Looking at where the failure is occurring, it seems the 
backing store code itself is failing here (from such a large allocation).  
Since the failure is with a C++ mmap call itself, that is perhaps more 
problematic - is "Cannot allocate memory" the failure from the perror() call on 
the line above the fatal() print?

Regarding the other question, and the failures more generally: we have never 
tested with > 64 CUs before, so certainly you are stressing the system and 
encountering different kinds of failures than we have seen previously.

In terms of applications, I had thought most/all of them passed previously, but 
we do not test each and every one all the time because this would make our 
weekly regressions run for a very long time.  You can see here: 
https://gem5.googlesource.com/public/gem5/+/refs/heads/develop/tests/weekly.sh#176
 which ones we run on a weekly basis.  I expect all of those to pass (although 
your comment seems to indicate that is not always true?).  Your issues are 
exposing that perhaps we need to test more of them beyond these 3 - perhaps on 
a quarterly basis or something though to avoid inflating the weekly runtime.  
Having said that, I have not run LRN in a long time, as some ML people told me 
that LRN was not widely used anymore.  But when I did run it, I do remember it 
requiring a large amount of memory - which squares with what you are seeing 
here.  I thought LRN needed -mem-size=32 GB to run, but based on your message 
it seems that is not the case.

@Matt P: have you tried LRN lately?  If so, have you run into the same 
OOM/backing store failures?

I know Kyle R. is looking into your other failure, so this one may have to wait 
behind it from our end, unless Matt P knows of a fix.

Thanks,
Matt

From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Monday, March 14, 2022 4:38 PM
To: David Fong via gem5-users mailto:gem5-users@gem5.org>>
Cc: David Fong mailto:da...@chronostech.com>>
Subject: [gem5-users] gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi,

I'm getting an error related to memory for test_fwd_lrn
I increased the memory size from 4GB to 512GB I got memory size issue : "out of 
memory".

build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:599: warn: unimplemented 
ioctl: AMDKFD_IOC_SET_SCRATCH_BACKING_VA
build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:609: warn: unimplemented 
ioctl: AMDKFD_IOC_SET_TRAP_HANDLER
build/GCN3_X86/sim/mem_pool.cc:120: fatal: fatal condition freePages() <= 0 
occurred: Out of memory, please increase size of physical memory.

But once I increased mem size to 1024GB, 1536GB,2048GB I'm getting this DRAM 
device capacity issue.

docker run --rm -v ${PWD}:${PWD} -v 
${PWD}/gem5/gem5-resources/src/gpu/DNNMark/cachefiles:/root/.cache/miopen/2.9.0 
-w ${PWD} gcr.io/gem5-test/gcn-gpu:v21-2 gem5/build/GCN3_X86/gem5.opt 
gem5/configs/example/apu_se.py --mem-size 1536GB --num-compute-units 256 -n3 
--benchmark-root=gem5/gem5-resources/src/gpu/DNNMark/build/benchmarks/test_fwd_lrn
 -cdnnmark_test_fwd_lrn --options="-config 
gem5/gem5-resources/src/gpu/DNNMark/config_example/lrn_config.dnnmark -mmap 
gem5/gem5-resources/src/gpu/DNNMark/mmap.bin" |& tee 
gem5_gpu_cu256_run_dnnmark_test_fwd_lrn_50latency.log
Global frequency set at 1 ticks per second
build/GCN3_X86/mem/mem_interface.cc:791: warn: DRAM device capacity (8192 
Mbytes) does not match the address range assigned (2097152 Mbytes)
mmap: Cannot allocate memory
build/GCN3_X86/mem/physical.cc:231: fatal: Could not mmap 1649267441664 bytes 
for range [0:0x180]!


Smaller number of CUs like 4 also have same type of error.

Is there a regression script or regression log for DNNMark to show mem-size or 
configurations that are known 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-14 Thread Matt Sinclair via gem5-users
Hi David,

I have not seen this mmap error before, and my initial guess was the mmap error 
is happening because you are trying to allocate more memory than we created 
when mmap'ing the inputs for the applications (we do this to speed up SE mode, 
because otherwise initializing arrays can take several hours).  However, the 
fact that it is failing in physical.cc and not in the application itself is 
throwing me off there.  Looking at where the failure is occurring, it seems the 
backing store code itself is failing here (from such a large allocation).  
Since the failure is with a C++ mmap call itself, that is perhaps more 
problematic - is "Cannot allocate memory" the failure from the perror() call on 
the line above the fatal() print?

Regarding the other question, and the failures more generally: we have never 
tested with > 64 CUs before, so certainly you are stressing the system and 
encountering different kinds of failures than we have seen previously.

In terms of applications, I had thought most/all of them passed previously, but 
we do not test each and every one all the time because this would make our 
weekly regressions run for a very long time.  You can see here: 
https://gem5.googlesource.com/public/gem5/+/refs/heads/develop/tests/weekly.sh#176
 which ones we run on a weekly basis.  I expect all of those to pass (although 
your comment seems to indicate that is not always true?).  Your issues are 
exposing that perhaps we need to test more of them beyond these 3 - perhaps on 
a quarterly basis or something though to avoid inflating the weekly runtime.  
Having said that, I have not run LRN in a long time, as some ML people told me 
that LRN was not widely used anymore.  But when I did run it, I do remember it 
requiring a large amount of memory - which squares with what you are seeing 
here.  I thought LRN needed -mem-size=32 GB to run, but based on your message 
it seems that is not the case.

@Matt P: have you tried LRN lately?  If so, have you run into the same 
OOM/backing store failures?

I know Kyle R. is looking into your other failure, so this one may have to wait 
behind it from our end, unless Matt P knows of a fix.

Thanks,
Matt

From: David Fong via gem5-users 
Sent: Monday, March 14, 2022 4:38 PM
To: David Fong via gem5-users 
Cc: David Fong 
Subject: [gem5-users] gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi,

I'm getting an error related to memory for test_fwd_lrn
I increased the memory size from 4GB to 512GB I got memory size issue : "out of 
memory".

build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:599: warn: unimplemented 
ioctl: AMDKFD_IOC_SET_SCRATCH_BACKING_VA
build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:609: warn: unimplemented 
ioctl: AMDKFD_IOC_SET_TRAP_HANDLER
build/GCN3_X86/sim/mem_pool.cc:120: fatal: fatal condition freePages() <= 0 
occurred: Out of memory, please increase size of physical memory.

But once I increased mem size to 1024GB, 1536GB,2048GB I'm getting this DRAM 
device capacity issue.

docker run --rm -v ${PWD}:${PWD} -v 
${PWD}/gem5/gem5-resources/src/gpu/DNNMark/cachefiles:/root/.cache/miopen/2.9.0 
-w ${PWD} gcr.io/gem5-test/gcn-gpu:v21-2 gem5/build/GCN3_X86/gem5.opt 
gem5/configs/example/apu_se.py --mem-size 1536GB --num-compute-units 256 -n3 
--benchmark-root=gem5/gem5-resources/src/gpu/DNNMark/build/benchmarks/test_fwd_lrn
 -cdnnmark_test_fwd_lrn --options="-config 
gem5/gem5-resources/src/gpu/DNNMark/config_example/lrn_config.dnnmark -mmap 
gem5/gem5-resources/src/gpu/DNNMark/mmap.bin" |& tee 
gem5_gpu_cu256_run_dnnmark_test_fwd_lrn_50latency.log
Global frequency set at 1 ticks per second
build/GCN3_X86/mem/mem_interface.cc:791: warn: DRAM device capacity (8192 
Mbytes) does not match the address range assigned (2097152 Mbytes)
mmap: Cannot allocate memory
build/GCN3_X86/mem/physical.cc:231: fatal: Could not mmap 1649267441664 bytes 
for range [0:0x180]!


Smaller number of CUs like 4 also have same type of error.

Is there a regression script or regression log for DNNMark to show mem-size or 
configurations that are known working for DNNMark tests so
I can use same setup to run a few DNNMark tests?
Only test_fwd_softmax, test_bwd_softmax are working for CUs from 
{4,8,16,32,64,128,256}

Thanks,

David

___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s