Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-15 Thread Andrew Stubbs

On 15/02/2024 10:23, Thomas Schwinge wrote:

Hi!

On 2024-02-15T08:49:17+0100, Richard Biener  wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 14/02/2024 13:43, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:

On 2023-10-20T12:51:03+0100, Andrew Stubbs 
wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be
disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
(addc3): Add RDNA2 syntax variant.
(subc3): Likewise.
(2_exec): Add RDNA2 alternatives.
(vec_cmpdi): Likewise.
(vec_cmpdi): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_dup): Likewise.
(vec_cmpdi_dup_exec): Likewise.
(reduc__scal_): Disable for RDNA2.
(*_dpp_shr_): Likewise.
(*plus_carry_dpp_shr_): Likewise.
(*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- [...]



With the following hack applied to 'gcc/tree-vect-loop.cc':

@@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
(loop_vec_info
loop_vinfo,
   reduce_with_shift = have_whole_vector_shift (mode1);
   if (!VECTOR_MODE_P (mode1)
  || !directly_supported_p (code, vectype1))
reduce_with_shift = false;
+  reduce_with_shift = false;

..., I'm able to work around those regressions: by means of forcing
"Reduce using scalar code" instead of "Reduce using vector shifts".



The attached not-well-tested patch should allow only valid permutations.
Hopefully we go back to working code, but there'll be things that won't
vectorize. That said, the new "dump" output code has fewer and probably
cheaper instructions, so hmmm.


This fixes the reduced builtin-bitops-1.c on RDNA2.


I confirm that "amdgcn: Disallow unsupported permute on RDNA devices"
also obsoletes my 'reduce_with_shift = false;' hack -- and also cures a
good number of additional FAILs (regressions), where presumably we
permute via different code paths.  Thanks!

There also are a few regressions, but only minor:

 PASS: gcc.dg/vect/no-vfa-vect-depend-3.c (test for excess errors)
 PASS: gcc.dg/vect/no-vfa-vect-depend-3.c execution test
 PASS: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "vectorized 
1 loops" 4
 [-PASS:-]{+FAIL:+} gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect 
"dependence distance negative" 4

..., because:

 gcc.dg/vect/no-vfa-vect-depend-3.c: pattern found 6 times
 FAIL: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "dependence 
distance negative" 4

 PASS: gcc.dg/vect/vect-119.c (test for excess errors)
 [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected 
interleaving load of size 2" 1
 PASS: gcc.dg/vect/vect-119.c scan-tree-dump-not optimized "Invalid sum"

..., because:

 gcc.dg/vect/vect-119.c: pattern found 3 times
 FAIL: gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected interleaving 
load of size 2" 1

 PASS: gcc.dg/vect/vect-reduc-mul_1.c (test for excess errors)
 PASS: gcc.dg/vect/vect-reduc-mul_1.c execution test
 [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_1.c scan-tree-dump vect "Reduce 
using vector shifts"

 PASS: gcc.dg/vect/vect-reduc-mul_2.c (test for excess errors)
 PASS: gcc.dg/vect/vect-reduc-mul_2.c execution test
 [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_2.c scan-tree-dump vect "Reduce 
using vector shifts"

..., plus the following, in combination with the earlier changes
disabling patterns:

 PASS: gcc.dg/vect/vect-reduc-or_1.c (test for excess errors)
 PASS: gcc.dg/vect/vect-reduc-or_1.c execution test
 [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_1.c scan-tree-dump vect "Reduce 
using direct vector reduction"

 PASS: gcc.dg/vect/vect-reduc-or_2.c (test for excess errors)
 PASS: gcc.dg/vect/vect-reduc-or_2.c execution test
 [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_2.c scan-tree-dump vect "Reduce 
using direct vector reduction"

Such test cases will need conditionalization on specific 

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-15 Thread Richard Biener
On Thu, 15 Feb 2024, Andrew Stubbs wrote:

> On 15/02/2024 10:21, Richard Biener wrote:
> [snip]
> >>> I suppse if RDNA really only has 32 lane vectors (it sounds like it,
> >>> even if it can "simulate" 64 lane ones?) then it might make sense to
> >>> vectorize for 32 lanes?  That said, with variable-length it likely
> >>> doesn't matter but I'd not expose fixed-size modes with 64 lanes then?
> >>
> >> For most operations, wavefrontsize=64 works just fine; the GPU runs each
> >> instruction twice and presents a pair of hardware registers as a logical
> >> 64-lane register. This breaks down for permutations and reductions, and is
> >> obviously inefficient when they vectors are not fully utilized, but is
> >> otherwise compatible with the GCN/CDNA compiler.
> >>
> >> I didn't want to invest all the effort it would take to support
> >> wavefrontsize=32, which would be the natural mode for these devices; the
> >> number of places that have "64" hard-coded is just too big. Not only that,
> >> but
> >> the EXEC and VCC registers change from DImode to SImode and that's going to
> >> break a lot of stuff. (And we have no paying customer for this.)
> >>
> >> I'm open to patch submissions. :)
> > 
> > OK, I see ;)  As said for fully masked that's a good answer.  I'd
> > probably still not expose V64mode modes in the RTL expanders for the
> > vect_* patterns?  Or, what happens if you change
> > gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA
> > and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA?
> 
> Changing the preferred mode probably would fix permute.
> 
> > Does that possibly leave performance on the plate? (not sure if there's
> > any documents about choosing wavefrontsize=64 vs 32 with regard to
> > performance)
> > 
> > Note it would entirely forbit the vectorizer from using larger modes,
> > it just makes it prefer the smaller ones.  OTOH if you then run
> > wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction
> > by always masking it?
> 
> Right, the GPU will continue to process the "top half" of the vector as an
> additional step, regardless whether you put anything useful there, or not.
> 
> > So yeah.  Guess a s/64/wavefrontsize/ would be a first step towards
> > allowing 32 there ...
> 
> I think the DImode to SImode change is the most difficult fix. Unless you know
> of a cunning trick, that's going to mean a lot of changes to a lot of the
> machine description; substitutions, duplications, iterators, indirections,
> etc., etc., etc.

Hmm, maybe just leave it at DImode in the patterns?  OTOH mode
iterators to do both SImode and DImode might work as well, but yeah,
a lot of churn.

Richard.


Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-15 Thread Andrew Stubbs

On 15/02/2024 10:21, Richard Biener wrote:
[snip]

I suppse if RDNA really only has 32 lane vectors (it sounds like it,
even if it can "simulate" 64 lane ones?) then it might make sense to
vectorize for 32 lanes?  That said, with variable-length it likely
doesn't matter but I'd not expose fixed-size modes with 64 lanes then?


For most operations, wavefrontsize=64 works just fine; the GPU runs each
instruction twice and presents a pair of hardware registers as a logical
64-lane register. This breaks down for permutations and reductions, and is
obviously inefficient when they vectors are not fully utilized, but is
otherwise compatible with the GCN/CDNA compiler.

I didn't want to invest all the effort it would take to support
wavefrontsize=32, which would be the natural mode for these devices; the
number of places that have "64" hard-coded is just too big. Not only that, but
the EXEC and VCC registers change from DImode to SImode and that's going to
break a lot of stuff. (And we have no paying customer for this.)

I'm open to patch submissions. :)


OK, I see ;)  As said for fully masked that's a good answer.  I'd
probably still not expose V64mode modes in the RTL expanders for the
vect_* patterns?  Or, what happens if you change
gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA
and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA?


Changing the preferred mode probably would fix permute.


Does that possibly leave performance on the plate? (not sure if there's
any documents about choosing wavefrontsize=64 vs 32 with regard to
performance)

Note it would entirely forbit the vectorizer from using larger modes,
it just makes it prefer the smaller ones.  OTOH if you then run
wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction
by always masking it?


Right, the GPU will continue to process the "top half" of the vector as 
an additional step, regardless whether you put anything useful there, or 
not.



So yeah.  Guess a s/64/wavefrontsize/ would be a first step towards
allowing 32 there ...


I think the DImode to SImode change is the most difficult fix. Unless 
you know of a cunning trick, that's going to mean a lot of changes to a 
lot of the machine description; substitutions, duplications, iterators, 
indirections, etc., etc., etc.


The "64" substitution would be tedious but less hairy. I did a lot of 
those when I created the fake vector sizes.



Anyway, the fix works, so that's the most important thing ;)


:)

Andrew


Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-15 Thread Thomas Schwinge
Hi!

On 2024-02-15T08:49:17+0100, Richard Biener  wrote:
> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>> On 14/02/2024 13:43, Richard Biener wrote:
>> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>> >> On 14/02/2024 13:27, Richard Biener wrote:
>> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>>  On 13/02/2024 08:26, Richard Biener wrote:
>> > On Mon, 12 Feb 2024, Thomas Schwinge wrote:
>> >> On 2023-10-20T12:51:03+0100, Andrew Stubbs 
>> >> wrote:
>> >>> I've committed this patch
>> >>
>> >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>> >> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
>> >>
>> >> The RDNA2 ISA variant doesn't support certain instructions previous
>> >> implemented in GCC/GCN, so a number of patterns etc. had to be
>> >> disabled:
>> >>
>> >>> [...] Vector
>> >>> reductions will need to be reworked for RDNA2.  [...]
>> >>
>> >>>* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
>> >>>(addc3): Add RDNA2 syntax variant.
>> >>>(subc3): Likewise.
>> >>>(2_exec): Add RDNA2 alternatives.
>> >>>(vec_cmpdi): Likewise.
>> >>>(vec_cmpdi): Likewise.
>> >>>(vec_cmpdi_exec): Likewise.
>> >>>(vec_cmpdi_exec): Likewise.
>> >>>(vec_cmpdi_dup): Likewise.
>> >>>(vec_cmpdi_dup_exec): Likewise.
>> >>>(reduc__scal_): Disable for RDNA2.
>> >>>(*_dpp_shr_): Likewise.
>> >>>(*plus_carry_dpp_shr_): Likewise.
>> >>>(*plus_carry_in_dpp_shr_): Likewise.
>> >>
>> >> Etc.  The expectation being that GCC middle end copes with this, and
>> >> synthesizes some less ideal yet still functional vector code, I 
>> >> presume.
>> >>
>> >> The later RDNA3/gfx1100 support builds on top of this, and that's what
>> >> I'm currently working on getting proper GCC/GCN target (not 
>> >> offloading)
>> >> results for.
>> >>
>> >> I'm seeing a good number of execution test FAILs (regressions 
>> >> compared to
>> >> my earlier non-gfx1100 testing), and I've now tracked down where one
>> >> large class of those comes into existance -- [...]

>> >> With the following hack applied to 'gcc/tree-vect-loop.cc':
>> >>
>> >>@@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
>> >>(loop_vec_info
>> >>loop_vinfo,
>> >>   reduce_with_shift = have_whole_vector_shift (mode1);
>> >>   if (!VECTOR_MODE_P (mode1)
>> >>  || !directly_supported_p (code, vectype1))
>> >>reduce_with_shift = false;
>> >>+  reduce_with_shift = false;
>> >>
>> >> ..., I'm able to work around those regressions: by means of forcing
>> >> "Reduce using scalar code" instead of "Reduce using vector shifts".

>> The attached not-well-tested patch should allow only valid permutations.
>> Hopefully we go back to working code, but there'll be things that won't
>> vectorize. That said, the new "dump" output code has fewer and probably
>> cheaper instructions, so hmmm.
>
> This fixes the reduced builtin-bitops-1.c on RDNA2.

I confirm that "amdgcn: Disallow unsupported permute on RDNA devices"
also obsoletes my 'reduce_with_shift = false;' hack -- and also cures a
good number of additional FAILs (regressions), where presumably we
permute via different code paths.  Thanks!

There also are a few regressions, but only minor:

PASS: gcc.dg/vect/no-vfa-vect-depend-3.c (test for excess errors)
PASS: gcc.dg/vect/no-vfa-vect-depend-3.c execution test
PASS: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect 
"vectorized 1 loops" 4
[-PASS:-]{+FAIL:+} gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times 
vect "dependence distance negative" 4

..., because:

gcc.dg/vect/no-vfa-vect-depend-3.c: pattern found 6 times
FAIL: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect 
"dependence distance negative" 4

PASS: gcc.dg/vect/vect-119.c (test for excess errors)
[-PASS:-]{+FAIL:+} gcc.dg/vect/vect-119.c scan-tree-dump-times vect 
"Detected interleaving load of size 2" 1
PASS: gcc.dg/vect/vect-119.c scan-tree-dump-not optimized "Invalid sum"

..., because:

gcc.dg/vect/vect-119.c: pattern found 3 times
FAIL: gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected 
interleaving load of size 2" 1

PASS: gcc.dg/vect/vect-reduc-mul_1.c (test for excess errors)
PASS: gcc.dg/vect/vect-reduc-mul_1.c execution test
[-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_1.c scan-tree-dump vect 
"Reduce using vector shifts"

PASS: gcc.dg/vect/vect-reduc-mul_2.c (test for excess errors)
PASS: gcc.dg/vect/vect-reduc-mul_2.c execution test
[-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_2.c scan-tree-dump vect 
"Reduce using vector shifts"

..., plus the following, in combination with the earlier changes
disabling patterns:

PASS: 

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-15 Thread Richard Biener
On Thu, 15 Feb 2024, Andrew Stubbs wrote:

> On 15/02/2024 07:49, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 14/02/2024 13:43, Richard Biener wrote:
> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> >>>
>  On 14/02/2024 13:27, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> >
> >> On 13/02/2024 08:26, Richard Biener wrote:
> >>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >>>
>  Hi!
> 
>  On 2023-10-20T12:51:03+0100, Andrew Stubbs 
>  wrote:
> > I've committed this patch
> 
>  ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>  "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> 
>  The RDNA2 ISA variant doesn't support certain instructions previous
>  implemented in GCC/GCN, so a number of patterns etc. had to be
>  disabled:
> 
> > [...] Vector
> > reductions will need to be reworked for RDNA2.  [...]
> 
> > * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
> > (addc3): Add RDNA2 syntax variant.
> > (subc3): Likewise.
> > (2_exec): Add RDNA2 alternatives.
> > (vec_cmpdi): Likewise.
> > (vec_cmpdi): Likewise.
> > (vec_cmpdi_exec): Likewise.
> > (vec_cmpdi_exec): Likewise.
> > (vec_cmpdi_dup): Likewise.
> > (vec_cmpdi_dup_exec): Likewise.
> > (reduc__scal_): Disable for RDNA2.
> > (*_dpp_shr_): Likewise.
> > (*plus_carry_dpp_shr_): Likewise.
> > (*plus_carry_in_dpp_shr_): Likewise.
> 
>  Etc.  The expectation being that GCC middle end copes with this, and
>  synthesizes some less ideal yet still functional vector code, I
>  presume.
> 
>  The later RDNA3/gfx1100 support builds on top of this, and that's
>  what
>  I'm currently working on getting proper GCC/GCN target (not
>  offloading)
>  results for.
> 
>  I'm seeing a good number of execution test FAILs (regressions
>  compared
>  to
>  my earlier non-gfx1100 testing), and I've now tracked down where one
>  large class of those comes into existance -- not yet how to resolve,
>  unfortunately.  But maybe, with you guys' combined vectorizer and
>  back
>  end experience, the latter will be done quickly?
> 
>  Richard, I don't know if you've ever run actual GCC/GCN target (not
>  offloading) testing; let me know if you have any questions about
>  that.
> >>>
> >>> I've only done offload testing - in the x86_64 build tree run
> >>> check-target-libgomp.  If you can tell me how to do GCN target testing
> >>> (maybe document it on the wiki even!) I can try do that as well.
> >>>
>  Given that (at least largely?) the same patterns etc. are disabled as
>  in
>  my gfx1100 configuration, I suppose your gfx1030 one would exhibit
>  the
>  same issues.  You can build GCC/GCN target like you build the
>  offloading
>  one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you
>  can
>  even use a offloading GCC/GCN build to reproduce the issue below.
> 
>  One example is the attached 'builtin-bitops-1.c', reduced from
>  'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
>  miscompiled as soon as '-ftree-vectorize' is effective:
> 
>  $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
>  -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>  -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
>  -fdump-ipa-all-all -fdump-rtl-all-all -save-temps
>  -march=gfx1100
>  -O1
>  -ftree-vectorize
> 
>  In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
>  '-march=gfx90a' vs. '-march=gfx1100', we see:
> 
>  +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
>  target.
> 
>  ..., and therefore:
> 
>  -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
>  reduction.
>  +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>  +builtin-bitops-1.c:7:17: note:  extract scalar result
> 
>  That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we
>  build
>  a
>  chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
>  generated:
> 
>  $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>  i=1, ints[i]=0x1 a=1, b=2
>  i=2, 

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-15 Thread Andrew Stubbs

On 15/02/2024 07:49, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 14/02/2024 13:43, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs 
wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be
disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
(addc3): Add RDNA2 syntax variant.
(subc3): Likewise.
(2_exec): Add RDNA2 alternatives.
(vec_cmpdi): Likewise.
(vec_cmpdi): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_dup): Likewise.
(vec_cmpdi_dup_exec): Likewise.
(reduc__scal_): Disable for RDNA2.
(*_dpp_shr_): Likewise.
(*plus_carry_dpp_shr_): Likewise.
(*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I
presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared
to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as
in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the
offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

$ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
-fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
-O1
-ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

+builtin-bitops-1.c:7:17: missed:   reduc op not supported by
target.

..., and therefore:

-builtin-bitops-1.c:7:17: note:  Reduce using direct vector
reduction.
+builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
+builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build
a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

$ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
i=1, ints[i]=0x1 a=1, b=2
i=2, ints[i]=0x8000 a=1, b=2
i=3, ints[i]=0x2 a=1, b=2
i=4, ints[i]=0x4000 a=1, b=2
i=5, ints[i]=0x1 a=1, b=2
i=6, ints[i]=0x8000 a=1, b=2
i=7, ints[i]=0xa5a5a5a5 a=16, b=32
i=8, ints[i]=0x5a5a5a5a a=16, b=32
i=9, ints[i]=0xcafe a=11, b=22
i=10, ints[i]=0xcafe00 a=11, b=22
i=11, ints[i]=0xcafe a=11, b=22
i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer
code,
or rather in the GCN back end, or GCN back end parameterizing the
generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized'
code:


Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Richard Biener
On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 14/02/2024 13:43, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 14/02/2024 13:27, Richard Biener wrote:
> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> >>>
>  On 13/02/2024 08:26, Richard Biener wrote:
> > On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >
> >> Hi!
> >>
> >> On 2023-10-20T12:51:03+0100, Andrew Stubbs 
> >> wrote:
> >>> I've committed this patch
> >>
> >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>
> >> The RDNA2 ISA variant doesn't support certain instructions previous
> >> implemented in GCC/GCN, so a number of patterns etc. had to be
> >> disabled:
> >>
> >>> [...] Vector
> >>> reductions will need to be reworked for RDNA2.  [...]
> >>
> >>>* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
> >>>(addc3): Add RDNA2 syntax variant.
> >>>(subc3): Likewise.
> >>>(2_exec): Add RDNA2 alternatives.
> >>>(vec_cmpdi): Likewise.
> >>>(vec_cmpdi): Likewise.
> >>>(vec_cmpdi_exec): Likewise.
> >>>(vec_cmpdi_exec): Likewise.
> >>>(vec_cmpdi_dup): Likewise.
> >>>(vec_cmpdi_dup_exec): Likewise.
> >>>(reduc__scal_): Disable for RDNA2.
> >>>(*_dpp_shr_): Likewise.
> >>>(*plus_carry_dpp_shr_): Likewise.
> >>>(*plus_carry_in_dpp_shr_): Likewise.
> >>
> >> Etc.  The expectation being that GCC middle end copes with this, and
> >> synthesizes some less ideal yet still functional vector code, I
> >> presume.
> >>
> >> The later RDNA3/gfx1100 support builds on top of this, and that's what
> >> I'm currently working on getting proper GCC/GCN target (not offloading)
> >> results for.
> >>
> >> I'm seeing a good number of execution test FAILs (regressions compared
> >> to
> >> my earlier non-gfx1100 testing), and I've now tracked down where one
> >> large class of those comes into existance -- not yet how to resolve,
> >> unfortunately.  But maybe, with you guys' combined vectorizer and back
> >> end experience, the latter will be done quickly?
> >>
> >> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >> offloading) testing; let me know if you have any questions about that.
> >
> > I've only done offload testing - in the x86_64 build tree run
> > check-target-libgomp.  If you can tell me how to do GCN target testing
> > (maybe document it on the wiki even!) I can try do that as well.
> >
> >> Given that (at least largely?) the same patterns etc. are disabled as
> >> in
> >> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> >> same issues.  You can build GCC/GCN target like you build the
> >> offloading
> >> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> >> even use a offloading GCC/GCN build to reproduce the issue below.
> >>
> >> One example is the attached 'builtin-bitops-1.c', reduced from
> >> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >> miscompiled as soon as '-ftree-vectorize' is effective:
> >>
> >>$ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>-fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
> >>-O1
> >>-ftree-vectorize
> >>
> >> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>
> >>+builtin-bitops-1.c:7:17: missed:   reduc op not supported by
> >>target.
> >>
> >> ..., and therefore:
> >>
> >>-builtin-bitops-1.c:7:17: note:  Reduce using direct vector
> >>reduction.
> >>+builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>+builtin-bitops-1.c:7:17: note:  extract scalar result
> >>
> >> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build
> >> a
> >> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >> generated:
> >>
> >>$ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>i=1, ints[i]=0x1 a=1, b=2
> >>i=2, ints[i]=0x8000 a=1, b=2
> >>i=3, ints[i]=0x2 a=1, b=2
> >>i=4, ints[i]=0x4000 a=1, b=2
> >>i=5, ints[i]=0x1 a=1, b=2
> >>i=6, ints[i]=0x8000 a=1, b=2
> >>i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>i=9, ints[i]=0xcafe a=11, b=22
> >>i=10, ints[i]=0xcafe00 a=11, b=22
> >>i=11, ints[i]=0xcafe 

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Andrew Stubbs

On 14/02/2024 13:43, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



   * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
   (addc3): Add RDNA2 syntax variant.
   (subc3): Likewise.
   (2_exec): Add RDNA2 alternatives.
   (vec_cmpdi): Likewise.
   (vec_cmpdi): Likewise.
   (vec_cmpdi_exec): Likewise.
   (vec_cmpdi_exec): Likewise.
   (vec_cmpdi_dup): Likewise.
   (vec_cmpdi_dup_exec): Likewise.
   (reduc__scal_): Disable for RDNA2.
   (*_dpp_shr_): Likewise.
   (*plus_carry_dpp_shr_): Likewise.
   (*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

   $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
   -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
   -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
   -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
   -O1
   -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

   +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
   target.

..., and therefore:

   -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
   reduction.
   +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
   +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

   $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
   i=1, ints[i]=0x1 a=1, b=2
   i=2, ints[i]=0x8000 a=1, b=2
   i=3, ints[i]=0x2 a=1, b=2
   i=4, ints[i]=0x4000 a=1, b=2
   i=5, ints[i]=0x1 a=1, b=2
   i=6, ints[i]=0x8000 a=1, b=2
   i=7, ints[i]=0xa5a5a5a5 a=16, b=32
   i=8, ints[i]=0x5a5a5a5a a=16, b=32
   i=9, ints[i]=0xcafe a=11, b=22
   i=10, ints[i]=0xcafe00 a=11, b=22
   i=11, ints[i]=0xcafe a=11, b=22
   i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

   int my_popcount (unsigned int x)
   {
 int stmp__12.12;
 vector(64) int vect__12.11;
 vector(64) 

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Richard Biener
On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 14/02/2024 13:27, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 13/02/2024 08:26, Richard Biener wrote:
> >>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >>>
>  Hi!
> 
>  On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:
> > I've committed this patch
> 
>  ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>  "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> 
>  The RDNA2 ISA variant doesn't support certain instructions previous
>  implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> 
> > [...] Vector
> > reductions will need to be reworked for RDNA2.  [...]
> 
> >   * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
> >   (addc3): Add RDNA2 syntax variant.
> >   (subc3): Likewise.
> >   (2_exec): Add RDNA2 alternatives.
> >   (vec_cmpdi): Likewise.
> >   (vec_cmpdi): Likewise.
> >   (vec_cmpdi_exec): Likewise.
> >   (vec_cmpdi_exec): Likewise.
> >   (vec_cmpdi_dup): Likewise.
> >   (vec_cmpdi_dup_exec): Likewise.
> >   (reduc__scal_): Disable for RDNA2.
> >   (*_dpp_shr_): Likewise.
> >   (*plus_carry_dpp_shr_): Likewise.
> >   (*plus_carry_in_dpp_shr_): Likewise.
> 
>  Etc.  The expectation being that GCC middle end copes with this, and
>  synthesizes some less ideal yet still functional vector code, I presume.
> 
>  The later RDNA3/gfx1100 support builds on top of this, and that's what
>  I'm currently working on getting proper GCC/GCN target (not offloading)
>  results for.
> 
>  I'm seeing a good number of execution test FAILs (regressions compared to
>  my earlier non-gfx1100 testing), and I've now tracked down where one
>  large class of those comes into existance -- not yet how to resolve,
>  unfortunately.  But maybe, with you guys' combined vectorizer and back
>  end experience, the latter will be done quickly?
> 
>  Richard, I don't know if you've ever run actual GCC/GCN target (not
>  offloading) testing; let me know if you have any questions about that.
> >>>
> >>> I've only done offload testing - in the x86_64 build tree run
> >>> check-target-libgomp.  If you can tell me how to do GCN target testing
> >>> (maybe document it on the wiki even!) I can try do that as well.
> >>>
>  Given that (at least largely?) the same patterns etc. are disabled as in
>  my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
>  same issues.  You can build GCC/GCN target like you build the offloading
>  one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
>  even use a offloading GCC/GCN build to reproduce the issue below.
> 
>  One example is the attached 'builtin-bitops-1.c', reduced from
>  'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
>  miscompiled as soon as '-ftree-vectorize' is effective:
> 
>    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
>    -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>    -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
>    -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
>    -O1
>    -ftree-vectorize
> 
>  In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
>  '-march=gfx90a' vs. '-march=gfx1100', we see:
> 
>    +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
>    target.
> 
>  ..., and therefore:
> 
>    -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
>    reduction.
>    +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>    +builtin-bitops-1.c:7:17: note:  extract scalar result
> 
>  That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
>  chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
>  generated:
> 
>    $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>    i=1, ints[i]=0x1 a=1, b=2
>    i=2, ints[i]=0x8000 a=1, b=2
>    i=3, ints[i]=0x2 a=1, b=2
>    i=4, ints[i]=0x4000 a=1, b=2
>    i=5, ints[i]=0x1 a=1, b=2
>    i=6, ints[i]=0x8000 a=1, b=2
>    i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>    i=8, ints[i]=0x5a5a5a5a a=16, b=32
>    i=9, ints[i]=0xcafe a=11, b=22
>    i=10, ints[i]=0xcafe00 a=11, b=22
>    i=11, ints[i]=0xcafe a=11, b=22
>    i=12, ints[i]=0x a=32, b=64
> 
>  (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> 
>  I don't speak enough "vectorization" to fully understand the generic
>  vectorized algorithm and its implementation.  It appears that the
>  "Reduce using vector shifts" code has been around for a very long time,
>  but also has gone 

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Andrew Stubbs

On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



  * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
  (addc3): Add RDNA2 syntax variant.
  (subc3): Likewise.
  (2_exec): Add RDNA2 alternatives.
  (vec_cmpdi): Likewise.
  (vec_cmpdi): Likewise.
  (vec_cmpdi_exec): Likewise.
  (vec_cmpdi_exec): Likewise.
  (vec_cmpdi_dup): Likewise.
  (vec_cmpdi_dup_exec): Likewise.
  (reduc__scal_): Disable for RDNA2.
  (*_dpp_shr_): Likewise.
  (*plus_carry_dpp_shr_): Likewise.
  (*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

  $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
  -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
  -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
  -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1
  -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

  +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.

..., and therefore:

  -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
  +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
  +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

  $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
  i=1, ints[i]=0x1 a=1, b=2
  i=2, ints[i]=0x8000 a=1, b=2
  i=3, ints[i]=0x2 a=1, b=2
  i=4, ints[i]=0x4000 a=1, b=2
  i=5, ints[i]=0x1 a=1, b=2
  i=6, ints[i]=0x8000 a=1, b=2
  i=7, ints[i]=0xa5a5a5a5 a=16, b=32
  i=8, ints[i]=0x5a5a5a5a a=16, b=32
  i=9, ints[i]=0xcafe a=11, b=22
  i=10, ints[i]=0xcafe00 a=11, b=22
  i=11, ints[i]=0xcafe a=11, b=22
  i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

  int my_popcount (unsigned int x)
  {
int stmp__12.12;
vector(64) int vect__12.11;
vector(64) unsigned int vect__1.8;
vector(64) unsigned int _13;
vector(64) unsigned int vect_cst__18;
vector(64) int [all others];
  

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Richard Biener
On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 13/02/2024 08:26, Richard Biener wrote:
> > On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> > 
> >> Hi!
> >>
> >> On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:
> >>> I've committed this patch
> >>
> >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>
> >> The RDNA2 ISA variant doesn't support certain instructions previous
> >> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> >>
> >>> [...] Vector
> >>> reductions will need to be reworked for RDNA2.  [...]
> >>
> >>>  * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
> >>>  (addc3): Add RDNA2 syntax variant.
> >>>  (subc3): Likewise.
> >>>  (2_exec): Add RDNA2 alternatives.
> >>>  (vec_cmpdi): Likewise.
> >>>  (vec_cmpdi): Likewise.
> >>>  (vec_cmpdi_exec): Likewise.
> >>>  (vec_cmpdi_exec): Likewise.
> >>>  (vec_cmpdi_dup): Likewise.
> >>>  (vec_cmpdi_dup_exec): Likewise.
> >>>  (reduc__scal_): Disable for RDNA2.
> >>>  (*_dpp_shr_): Likewise.
> >>>  (*plus_carry_dpp_shr_): Likewise.
> >>>  (*plus_carry_in_dpp_shr_): Likewise.
> >>
> >> Etc.  The expectation being that GCC middle end copes with this, and
> >> synthesizes some less ideal yet still functional vector code, I presume.
> >>
> >> The later RDNA3/gfx1100 support builds on top of this, and that's what
> >> I'm currently working on getting proper GCC/GCN target (not offloading)
> >> results for.
> >>
> >> I'm seeing a good number of execution test FAILs (regressions compared to
> >> my earlier non-gfx1100 testing), and I've now tracked down where one
> >> large class of those comes into existance -- not yet how to resolve,
> >> unfortunately.  But maybe, with you guys' combined vectorizer and back
> >> end experience, the latter will be done quickly?
> >>
> >> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >> offloading) testing; let me know if you have any questions about that.
> > 
> > I've only done offload testing - in the x86_64 build tree run
> > check-target-libgomp.  If you can tell me how to do GCN target testing
> > (maybe document it on the wiki even!) I can try do that as well.
> > 
> >> Given that (at least largely?) the same patterns etc. are disabled as in
> >> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> >> same issues.  You can build GCC/GCN target like you build the offloading
> >> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> >> even use a offloading GCC/GCN build to reproduce the issue below.
> >>
> >> One example is the attached 'builtin-bitops-1.c', reduced from
> >> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >> miscompiled as soon as '-ftree-vectorize' is effective:
> >>
> >>  $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>  -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>  -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>  -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1
> >>  -ftree-vectorize
> >>
> >> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>
> >>  +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.
> >>
> >> ..., and therefore:
> >>
> >>  -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
> >>  +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>  +builtin-bitops-1.c:7:17: note:  extract scalar result
> >>
> >> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
> >> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >> generated:
> >>
> >>  $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>  i=1, ints[i]=0x1 a=1, b=2
> >>  i=2, ints[i]=0x8000 a=1, b=2
> >>  i=3, ints[i]=0x2 a=1, b=2
> >>  i=4, ints[i]=0x4000 a=1, b=2
> >>  i=5, ints[i]=0x1 a=1, b=2
> >>  i=6, ints[i]=0x8000 a=1, b=2
> >>  i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>  i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>  i=9, ints[i]=0xcafe a=11, b=22
> >>  i=10, ints[i]=0xcafe00 a=11, b=22
> >>  i=11, ints[i]=0xcafe a=11, b=22
> >>  i=12, ints[i]=0x a=32, b=64
> >>
> >> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> >>
> >> I don't speak enough "vectorization" to fully understand the generic
> >> vectorized algorithm and its implementation.  It appears that the
> >> "Reduce using vector shifts" code has been around for a very long time,
> >> but also has gone through a number of changes.  I can't tell which GCC
> >> targets/configurations it's actually used for (in the same way as for
> >> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
> >> or rather in the GCN back end, or GCN back end parameterizing the generic
> >> code?
> > 
> > The "shift" reduction is basically doing reduction by repeatedly
> > 

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Andrew Stubbs

On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
(addc3): Add RDNA2 syntax variant.
(subc3): Likewise.
(2_exec): Add RDNA2 alternatives.
(vec_cmpdi): Likewise.
(vec_cmpdi): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_dup): Likewise.
(vec_cmpdi_dup_exec): Likewise.
(reduc__scal_): Disable for RDNA2.
(*_dpp_shr_): Likewise.
(*plus_carry_dpp_shr_): Likewise.
(*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

 $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c 
-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ 
-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all 
-fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

 +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.

..., and therefore:

 -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
 +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
 +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

 $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
 i=1, ints[i]=0x1 a=1, b=2
 i=2, ints[i]=0x8000 a=1, b=2
 i=3, ints[i]=0x2 a=1, b=2
 i=4, ints[i]=0x4000 a=1, b=2
 i=5, ints[i]=0x1 a=1, b=2
 i=6, ints[i]=0x8000 a=1, b=2
 i=7, ints[i]=0xa5a5a5a5 a=16, b=32
 i=8, ints[i]=0x5a5a5a5a a=16, b=32
 i=9, ints[i]=0xcafe a=11, b=22
 i=10, ints[i]=0xcafe00 a=11, b=22
 i=11, ints[i]=0xcafe a=11, b=22
 i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

 int my_popcount (unsigned int x)
 {
   int stmp__12.12;
   vector(64) int vect__12.11;
   vector(64) unsigned int vect__1.8;
   vector(64) unsigned int _13;
   vector(64) unsigned int vect_cst__18;
   vector(64) int [all others];
 
[local count: 32534376]:

   vect_cst__18 = 

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts" (was: [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL)

2024-02-13 Thread Richard Biener
On Mon, 12 Feb 2024, Thomas Schwinge wrote:

> Hi!
> 
> On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:
> > I've committed this patch
> 
> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> 
> The RDNA2 ISA variant doesn't support certain instructions previous
> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> 
> > [...] Vector
> > reductions will need to be reworked for RDNA2.  [...]
> 
> > * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
> > (addc3): Add RDNA2 syntax variant.
> > (subc3): Likewise.
> > (2_exec): Add RDNA2 alternatives.
> > (vec_cmpdi): Likewise.
> > (vec_cmpdi): Likewise.
> > (vec_cmpdi_exec): Likewise.
> > (vec_cmpdi_exec): Likewise.
> > (vec_cmpdi_dup): Likewise.
> > (vec_cmpdi_dup_exec): Likewise.
> > (reduc__scal_): Disable for RDNA2.
> > (*_dpp_shr_): Likewise.
> > (*plus_carry_dpp_shr_): Likewise.
> > (*plus_carry_in_dpp_shr_): Likewise.
> 
> Etc.  The expectation being that GCC middle end copes with this, and
> synthesizes some less ideal yet still functional vector code, I presume.
> 
> The later RDNA3/gfx1100 support builds on top of this, and that's what
> I'm currently working on getting proper GCC/GCN target (not offloading)
> results for.
> 
> I'm seeing a good number of execution test FAILs (regressions compared to
> my earlier non-gfx1100 testing), and I've now tracked down where one
> large class of those comes into existance -- not yet how to resolve,
> unfortunately.  But maybe, with you guys' combined vectorizer and back
> end experience, the latter will be done quickly?
> 
> Richard, I don't know if you've ever run actual GCC/GCN target (not
> offloading) testing; let me know if you have any questions about that.

I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.

> Given that (at least largely?) the same patterns etc. are disabled as in
> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> same issues.  You can build GCC/GCN target like you build the offloading
> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> even use a offloading GCC/GCN build to reproduce the issue below.
> 
> One example is the attached 'builtin-bitops-1.c', reduced from
> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> miscompiled as soon as '-ftree-vectorize' is effective:
> 
> $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c 
> -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ 
> -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all 
> -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1 
> -ftree-vectorize
> 
> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> '-march=gfx90a' vs. '-march=gfx1100', we see:
> 
> +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.
> 
> ..., and therefore:
> 
> -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
> +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> +builtin-bitops-1.c:7:17: note:  extract scalar result
> 
> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> generated:
> 
> $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> i=1, ints[i]=0x1 a=1, b=2
> i=2, ints[i]=0x8000 a=1, b=2
> i=3, ints[i]=0x2 a=1, b=2
> i=4, ints[i]=0x4000 a=1, b=2
> i=5, ints[i]=0x1 a=1, b=2
> i=6, ints[i]=0x8000 a=1, b=2
> i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> i=8, ints[i]=0x5a5a5a5a a=16, b=32
> i=9, ints[i]=0xcafe a=11, b=22
> i=10, ints[i]=0xcafe00 a=11, b=22
> i=11, ints[i]=0xcafe a=11, b=22
> i=12, ints[i]=0x a=32, b=64
> 
> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> 
> I don't speak enough "vectorization" to fully understand the generic
> vectorized algorithm and its implementation.  It appears that the
> "Reduce using vector shifts" code has been around for a very long time,
> but also has gone through a number of changes.  I can't tell which GCC
> targets/configurations it's actually used for (in the same way as for
> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
> or rather in the GCN back end, or GCN back end parameterizing the generic
> code?

The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).

> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:
> 
> int my_popcount (unsigned int x)
> {
>   int stmp__12.12;
>   vector(64) int vect__12.11;
>   vector(64) unsigned int vect__1.8;
>   vector(64) unsigned int _13;
>   vector(64) unsigned 

GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts" (was: [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL)

2024-02-12 Thread Thomas Schwinge
Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:
> I've committed this patch

... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:

> [...] Vector
> reductions will need to be reworked for RDNA2.  [...]

>   * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
>   (addc3): Add RDNA2 syntax variant.
>   (subc3): Likewise.
>   (2_exec): Add RDNA2 alternatives.
>   (vec_cmpdi): Likewise.
>   (vec_cmpdi): Likewise.
>   (vec_cmpdi_exec): Likewise.
>   (vec_cmpdi_exec): Likewise.
>   (vec_cmpdi_dup): Likewise.
>   (vec_cmpdi_dup_exec): Likewise.
>   (reduc__scal_): Disable for RDNA2.
>   (*_dpp_shr_): Likewise.
>   (*plus_carry_dpp_shr_): Likewise.
>   (*plus_carry_in_dpp_shr_): Likewise.

Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.
Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

$ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c 
-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ 
-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all 
-fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

+builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.

..., and therefore:

-builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
+builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
+builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

$ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
i=1, ints[i]=0x1 a=1, b=2
i=2, ints[i]=0x8000 a=1, b=2
i=3, ints[i]=0x2 a=1, b=2
i=4, ints[i]=0x4000 a=1, b=2
i=5, ints[i]=0x1 a=1, b=2
i=6, ints[i]=0x8000 a=1, b=2
i=7, ints[i]=0xa5a5a5a5 a=16, b=32
i=8, ints[i]=0x5a5a5a5a a=16, b=32
i=9, ints[i]=0xcafe a=11, b=22
i=10, ints[i]=0xcafe00 a=11, b=22
i=11, ints[i]=0xcafe a=11, b=22
i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?

Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

int my_popcount (unsigned int x)
{
  int stmp__12.12;
  vector(64) int vect__12.11;
  vector(64) unsigned int vect__1.8;
  vector(64) unsigned int _13;
  vector(64) unsigned int vect_cst__18;
  vector(64) int [all others];

   [local count: 32534376]:
  vect_cst__18 = { [all 'x_8(D)'] };
  vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
  _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, { [all '1'] 
}, { [all '0'] });
  vect__12.11_24 = VIEW_CONVERT_EXPR(_13);
  _26 = VEC_PERM_EXPR ;
  _27 = vect__12.11_24 + _26;
  _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77, 78, 79 
}>;
  _29 = _27 + _28;
  _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69, 70,