GCN: '--param=gcn-preferred-vector-lane-width=[default,32,64]' (was: [committed] amdgcn: Prefer V32 on RDNA devices)

2024-04-08 Thread Thomas Schwinge
Hi!

On 2024-03-28T08:00:50+0100, I wrote:
> On 2024-03-22T15:54:48+, Andrew Stubbs  wrote:
>> This patch alters the default (preferred) vector size to 32 on RDNA devices 
>> to
>> better match the actual hardware.  64-lane vectors will continue to be
>> used where they are hard-coded (such as function prologues).
>>
>> We run these devices in wavefrontsize64 for compatibility, but they actually
>> only have 32-lane vectors, natively.  If the upper part of a V64 is masked
>> off (as it is in V32) then RDNA devices will skip execution of the upper part
>> for most operations, so this adjustment shouldn't leave too much performance 
>> on
>> the table.  One exception is memory instructions, so full wavefrontsize32
>> support would be better.
>>
>> The advantage is that we avoid the missing V64 operations (such as permute 
>> and
>> vec_extract).
>>
>> Committed to mainline.
>
> In my GCN target '-march=gfx1100' testing, this commit
> "amdgcn: Prefer V32 on RDNA devices" does resolve (or, make latent?) a
> number of execution test FAILs (that is, regressions compared to earlier
> '-march=gfx90a' etc. testing).
>
> This commit also resolves (for my '-march=gfx1100' testing) one
> pre-existing FAIL (that is, already seen in '-march=gfx90a' earlier
> etc. testing):
>
> PASS: gcc.dg/tree-ssa/scev-14.c (test for excess errors)
> [-FAIL:-]{+PASS:+} gcc.dg/tree-ssa/scev-14.c scan-tree-dump ivopts 
> "Overflowness wrto loop niter:\tNo-overflow"
>
> That means, this test case specifically (or, just its 'scan-tree-dump'?)
> needs to be adjusted for GCN V64 testing?
>
> This commit, as you'd also mentioned elsewhere, however also causes a
> number of regressions in 'gcc.target/gcn/gcn.exp', see list below.
>
> Those can be "fixed" with 'dg-additional-options -march=gfx90a' (or
> similar) in the affected test cases (let me know if you'd like me to
> 'git push' that), but I suppose something more elaborate may be in order?
> (Conditionalize those on 'target { ! gcn_rdna }', and add respective
> scanning for 'target gcn_rdna'?  I can help with effective-target
> 'gcn_rdna' (or similar), if you'd like me to.)
>
> And/or, have a '-mpreferred-simd-mode=v64' (or similar) to be used for
> such test cases, to override 'if (TARGET_RDNA2_PLUS)' etc. in
> 'gcn_vectorize_preferred_simd_mode'?

The latter I have quickly implemented, see attached
"GCN: '--param=gcn-preferred-vector-lane-width=[default,32,64]'".  OK to
push to trunk branch?

(This '--param' will also be useful for another bug/regression I'm about
to file.)

> Best, probably, both these things, to properly test both V32 and V64?

That part remains to be done, but is best done by someone who actually
knowns "GCN" assembly/GCC back end -- that is, not me.


Grüße
 Thomas


> PASS: gcc.target/gcn/cond_fmaxnm_1.c (test for excess errors)
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_1.c scan-assembler-not 
> \\tv_writelane_b32\\tv[0-9]+, vcc_..
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_1.c scan-assembler-times 
> smaxv64df3_exec 3
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_1.c scan-assembler-times 
> smaxv64sf3_exec 3
> PASS: gcc.target/gcn/cond_fmaxnm_1_run.c (test for excess errors)
> PASS: gcc.target/gcn/cond_fmaxnm_1_run.c execution test
>
> PASS: gcc.target/gcn/cond_fmaxnm_2.c (test for excess errors)
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_2.c scan-assembler-not 
> \\tv_writelane_b32\\tv[0-9]+, vcc_..
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_2.c scan-assembler-times 
> smaxv64df3_exec 3
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_2.c scan-assembler-times 
> smaxv64sf3_exec 3
> PASS: gcc.target/gcn/cond_fmaxnm_2_run.c (test for excess errors)
> PASS: gcc.target/gcn/cond_fmaxnm_2_run.c execution test
>
> PASS: gcc.target/gcn/cond_fmaxnm_3.c (test for excess errors)
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-not 
> \\tv_writelane_b32\\tv[0-9]+, vcc_..
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
> movv64df_exec 3
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
> movv64sf_exec 3
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
> smaxv64sf3 3
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
> smaxv64sf3 3
> PASS: gcc.target/gcn/cond_fmaxnm_3_run.c (test for excess errors)
> PASS: gcc.target/gcn/cond_fmaxnm_3_run.c execution test
>
> PASS: gcc.target/gcn/cond_fmaxnm_4.c (test for excess errors)
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_4.c scan-assembler-not 
> \\tv_writelane_b32\\tv[0-9]+, vcc_..
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_4.c scan-assembler-times 
> movv64df_exec 3
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_4.c scan-assembler-times 
> movv64sf_exec 3
> [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_4.c scan-assembler-times 
> smaxv64sf3 3
> [-PASS:-]{+FAIL:+} 

Re: [committed] amdgcn: Prefer V32 on RDNA devices

2024-03-28 Thread Thomas Schwinge
Hi Andrew!

On 2024-03-22T15:54:48+, Andrew Stubbs  wrote:
> This patch alters the default (preferred) vector size to 32 on RDNA devices to
> better match the actual hardware.  64-lane vectors will continue to be
> used where they are hard-coded (such as function prologues).
>
> We run these devices in wavefrontsize64 for compatibility, but they actually
> only have 32-lane vectors, natively.  If the upper part of a V64 is masked
> off (as it is in V32) then RDNA devices will skip execution of the upper part
> for most operations, so this adjustment shouldn't leave too much performance 
> on
> the table.  One exception is memory instructions, so full wavefrontsize32
> support would be better.
>
> The advantage is that we avoid the missing V64 operations (such as permute and
> vec_extract).
>
> Committed to mainline.

In my GCN target '-march=gfx1100' testing, this commit
"amdgcn: Prefer V32 on RDNA devices" does resolve (or, make latent?) a
number of execution test FAILs (that is, regressions compared to earlier
'-march=gfx90a' etc. testing).

This commit also resolves (for my '-march=gfx1100' testing) one
pre-existing FAIL (that is, already seen in '-march=gfx90a' earlier
etc. testing):

PASS: gcc.dg/tree-ssa/scev-14.c (test for excess errors)
[-FAIL:-]{+PASS:+} gcc.dg/tree-ssa/scev-14.c scan-tree-dump ivopts 
"Overflowness wrto loop niter:\tNo-overflow"

That means, this test case specifically (or, just its 'scan-tree-dump'?)
needs to be adjusted for GCN V64 testing?

This commit, as you'd also mentioned elsewhere, however also causes a
number of regressions in 'gcc.target/gcn/gcn.exp', see list below.

Those can be "fixed" with 'dg-additional-options -march=gfx90a' (or
similar) in the affected test cases (let me know if you'd like me to
'git push' that), but I suppose something more elaborate may be in order?
(Conditionalize those on 'target { ! gcn_rdna }', and add respective
scanning for 'target gcn_rdna'?  I can help with effective-target
'gcn_rdna' (or similar), if you'd like me to.)

And/or, have a '-mpreferred-simd-mode=v64' (or similar) to be used for
such test cases, to override 'if (TARGET_RDNA2_PLUS)' etc. in
'gcn_vectorize_preferred_simd_mode'?

Best, probably, both these things, to properly test both V32 and V64?

PASS: gcc.target/gcn/cond_fmaxnm_1.c (test for excess errors)
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_1.c scan-assembler-not 
\\tv_writelane_b32\\tv[0-9]+, vcc_..
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_1.c scan-assembler-times 
smaxv64df3_exec 3
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_1.c scan-assembler-times 
smaxv64sf3_exec 3
PASS: gcc.target/gcn/cond_fmaxnm_1_run.c (test for excess errors)
PASS: gcc.target/gcn/cond_fmaxnm_1_run.c execution test

PASS: gcc.target/gcn/cond_fmaxnm_2.c (test for excess errors)
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_2.c scan-assembler-not 
\\tv_writelane_b32\\tv[0-9]+, vcc_..
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_2.c scan-assembler-times 
smaxv64df3_exec 3
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_2.c scan-assembler-times 
smaxv64sf3_exec 3
PASS: gcc.target/gcn/cond_fmaxnm_2_run.c (test for excess errors)
PASS: gcc.target/gcn/cond_fmaxnm_2_run.c execution test

PASS: gcc.target/gcn/cond_fmaxnm_3.c (test for excess errors)
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-not 
\\tv_writelane_b32\\tv[0-9]+, vcc_..
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
movv64df_exec 3
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
movv64sf_exec 3
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
smaxv64sf3 3
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
smaxv64sf3 3
PASS: gcc.target/gcn/cond_fmaxnm_3_run.c (test for excess errors)
PASS: gcc.target/gcn/cond_fmaxnm_3_run.c execution test

PASS: gcc.target/gcn/cond_fmaxnm_4.c (test for excess errors)
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_4.c scan-assembler-not 
\\tv_writelane_b32\\tv[0-9]+, vcc_..
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_4.c scan-assembler-times 
movv64df_exec 3
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_4.c scan-assembler-times 
movv64sf_exec 3
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_4.c scan-assembler-times 
smaxv64sf3 3
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_4.c scan-assembler-times 
smaxv64sf3 3
PASS: gcc.target/gcn/cond_fmaxnm_4_run.c (test for excess errors)
PASS: gcc.target/gcn/cond_fmaxnm_4_run.c execution test

PASS: gcc.target/gcn/cond_fmaxnm_5.c (test for excess errors)
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_5.c scan-assembler-not 
\\tv_writelane_b32\\tv[0-9]+, vcc_..
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_5.c scan-assembler-times 
smaxv64df3_exec 3
[-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_5.c scan-assembler-times 
smaxv64sf3_exec 3
PASS: 

[committed] amdgcn: Prefer V32 on RDNA devices

2024-03-22 Thread Andrew Stubbs
This patch alters the default (preferred) vector size to 32 on RDNA devices to
better match the actual hardware.  64-lane vectors will continue to be
used where they are hard-coded (such as function prologues).

We run these devices in wavefrontsize64 for compatibility, but they actually
only have 32-lane vectors, natively.  If the upper part of a V64 is masked
off (as it is in V32) then RDNA devices will skip execution of the upper part
for most operations, so this adjustment shouldn't leave too much performance on
the table.  One exception is memory instructions, so full wavefrontsize32
support would be better.

The advantage is that we avoid the missing V64 operations (such as permute and
vec_extract).

Committed to mainline.

Andrew

gcc/ChangeLog:

* config/gcn/gcn.cc (gcn_vectorize_preferred_simd_mode): Prefer V32 on
RDNA devices.
---
 gcc/config/gcn/gcn.cc | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 498146dcde9..efb73af50c4 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -5226,6 +5226,32 @@ gcn_vector_mode_supported_p (machine_mode mode)
 static machine_mode
 gcn_vectorize_preferred_simd_mode (scalar_mode mode)
 {
+  /* RDNA devices have 32-lane vectors with limited support for 64-bit vectors
+ (in particular, permute operations are only available for cases that don't
+ span the 32-lane boundary).
+
+ From the RDNA3 manual: "Hardware may choose to skip either half if the
+ EXEC mask for that half is all zeros...". This means that preferring
+ 32-lanes is a good stop-gap until we have proper wave32 support.  */
+  if (TARGET_RDNA2_PLUS)
+switch (mode)
+  {
+  case E_QImode:
+   return V32QImode;
+  case E_HImode:
+   return V32HImode;
+  case E_SImode:
+   return V32SImode;
+  case E_DImode:
+   return V32DImode;
+  case E_SFmode:
+   return V32SFmode;
+  case E_DFmode:
+   return V32DFmode;
+  default:
+   return word_mode;
+  }
+
   switch (mode)
 {
 case E_QImode:
-- 
2.41.0