Re: [RFC] Teach vectorizer to deal with bitfield reads

2022-10-12 Thread Eric Botcazou via Gcc-patches
> Let me know if you believe this is a good approach? I've ran regression
> tests and this hasn't broken anything so far...

Small regression in Ada though, probably a missing guard somewhere:

=== gnat tests ===


Running target unix
FAIL: gnat.dg/loop_optimization23.adb 3 blank line(s) in output
FAIL: gnat.dg/loop_optimization23.adb (test for excess errors)
UNRESOLVED: gnat.dg/loop_optimization23.adb compilation failed to produce 
execut
able
FAIL: gnat.dg/loop_optimization23_pkg.adb 3 blank line(s) in output
FAIL: gnat.dg/loop_optimization23_pkg.adb (test for excess errors)

In order to reproduce, configure the compiler with Ada enabled, build it, and 
copy $[srcdir)/gcc/testsuite/gnat.dg/loop_optimization23_pkg.ad[sb] into the 
build directory, then just issue:

gcc/gnat1 -quiet loop_optimization23_pkg.adb -O2 -Igcc/ada/rts

eric@fomalhaut:~/build/gcc/native> gcc/gnat1 -quiet 
loop_optimization23_pkg.adb -O2 -Igcc/ada/rts
during GIMPLE pass: vect
+===GNAT BUG DETECTED==+
| 13.0.0 20221012 (experimental) [master ca7f7c3f140] (x86_64-suse-linux) GCC 
error:|
| in exact_div, at poly-int.h:2232 |
| Error detected around loop_optimization23_pkg.adb:5:3|
| Compiling loop_optimization23_pkg.adb

-- 
Eric Botcazou




Re: [RFC] Teach vectorizer to deal with bitfield reads

2022-08-01 Thread Richard Biener via Gcc-patches
On Mon, 1 Aug 2022, Andre Vieira (lists) wrote:

> 
> On 29/07/2022 11:52, Richard Biener wrote:
> > On Fri, 29 Jul 2022, Jakub Jelinek wrote:
> >
> >> On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via
> >> Gcc-patches wrote:
> >>> The 'only on the vectorized code path' remains the same though as
> >>> vect_recog
> >>> also only happens on the vectorized code path right?
> >> if conversion (in some cases) duplicates a loop and guards one copy with
> >> an ifn which resolves to true if that particular loop is vectorized and
> >> false otherwise.  So, then changes that shouldn't be done in case of
> >> vectorization failure can be done on the for vectorizer only copy of the
> >> loop.
> > And just to mention, one issue with lowering of bitfield accesses
> > is bitfield inserts which, on some architectures (hello m68k) have
> > instructions operating on memory directly.  For those it's difficult
> > to not regress in code quality if a bitfield store becomes a
> > read-modify-write cycle.  That's one of the things holding this
> > back.  One idea would be to lower to .INSV directly for those targets
> > (but presence of insv isn't necessarily indicating support for
> > memory destinations).
> >
> > Richard.
> Should I account for that when vectorizing though? From what I can tell (no
> TARGET_VECTOR_* hooks implemented) m68k does not have vectorization support.

No.

> So the question is, are there currently any targets that vectorize and have
> vector bitfield-insert/extract support? If they don't exist I suggest we worry
> about it when it comes around, if not just for the fact that we wouldn't be
> able to test it right now.
> 
> If this is about not lowering on the non-vectorized path, see my previous
> reply, I never intended to do that in the vectorizer. I just thought it was
> the plan to do lowering eventually.

Yes, for the vectorized path this all isn't an issue - and btw the
advantage with if-conversion is that you get VN of the result
"for free", the RMW cycle of bitfield stores likely have reads to
share (and also redundant stores in the end, but ...).

And yes, the plan was to do lowering generally.  Just the simplistic
approaches (my last one was a lowering pass somewhen after IPA, IIRC
combined with SRA) run into some issues, like that on m68k, but IIRC
also some others.  So I wouldn't hold my breath, but then just somebody
needs to do the work and think about how to deal with m68k and the
likes...

Richard.


Re: [RFC] Teach vectorizer to deal with bitfield reads

2022-08-01 Thread Andre Vieira (lists) via Gcc-patches



On 29/07/2022 11:52, Richard Biener wrote:

On Fri, 29 Jul 2022, Jakub Jelinek wrote:


On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via Gcc-patches 
wrote:

The 'only on the vectorized code path' remains the same though as vect_recog
also only happens on the vectorized code path right?

if conversion (in some cases) duplicates a loop and guards one copy with
an ifn which resolves to true if that particular loop is vectorized and
false otherwise.  So, then changes that shouldn't be done in case of
vectorization failure can be done on the for vectorizer only copy of the
loop.

And just to mention, one issue with lowering of bitfield accesses
is bitfield inserts which, on some architectures (hello m68k) have
instructions operating on memory directly.  For those it's difficult
to not regress in code quality if a bitfield store becomes a
read-modify-write cycle.  That's one of the things holding this
back.  One idea would be to lower to .INSV directly for those targets
(but presence of insv isn't necessarily indicating support for
memory destinations).

Richard.
Should I account for that when vectorizing though? From what I can tell 
(no TARGET_VECTOR_* hooks implemented) m68k does not have vectorization 
support. So the question is, are there currently any targets that 
vectorize and have vector bitfield-insert/extract support? If they don't 
exist I suggest we worry about it when it comes around, if not just for 
the fact that we wouldn't be able to test it right now.


If this is about not lowering on the non-vectorized path, see my 
previous reply, I never intended to do that in the vectorizer. I just 
thought it was the plan to do lowering eventually.




Re: [RFC] Teach vectorizer to deal with bitfield reads

2022-08-01 Thread Andre Vieira (lists) via Gcc-patches



On 29/07/2022 11:31, Jakub Jelinek wrote:

On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via Gcc-patches 
wrote:

The 'only on the vectorized code path' remains the same though as vect_recog
also only happens on the vectorized code path right?

if conversion (in some cases) duplicates a loop and guards one copy with
an ifn which resolves to true if that particular loop is vectorized and
false otherwise.  So, then changes that shouldn't be done in case of
vectorization failure can be done on the for vectorizer only copy of the
loop.

Jakub
I'm pretty sure vect_recog patterns have no effect on scalar codegen if 
the vectorization fails too. The patterns live as new vect_stmt_info's 
and no changes are actually done to the scalar loop. That was the point 
I was trying to make, but it doesn't matter that much, as I said I am 
happy to do this in if convert.




Re: [RFC] Teach vectorizer to deal with bitfield reads

2022-07-29 Thread Richard Biener via Gcc-patches
On Fri, 29 Jul 2022, Jakub Jelinek wrote:

> On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via 
> Gcc-patches wrote:
> > The 'only on the vectorized code path' remains the same though as vect_recog
> > also only happens on the vectorized code path right?
> 
> if conversion (in some cases) duplicates a loop and guards one copy with
> an ifn which resolves to true if that particular loop is vectorized and
> false otherwise.  So, then changes that shouldn't be done in case of
> vectorization failure can be done on the for vectorizer only copy of the
> loop.

And just to mention, one issue with lowering of bitfield accesses
is bitfield inserts which, on some architectures (hello m68k) have
instructions operating on memory directly.  For those it's difficult
to not regress in code quality if a bitfield store becomes a
read-modify-write cycle.  That's one of the things holding this
back.  One idea would be to lower to .INSV directly for those targets
(but presence of insv isn't necessarily indicating support for
memory destinations).

Richard.


Re: [RFC] Teach vectorizer to deal with bitfield reads

2022-07-29 Thread Jakub Jelinek via Gcc-patches
On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via Gcc-patches 
wrote:
> The 'only on the vectorized code path' remains the same though as vect_recog
> also only happens on the vectorized code path right?

if conversion (in some cases) duplicates a loop and guards one copy with
an ifn which resolves to true if that particular loop is vectorized and
false otherwise.  So, then changes that shouldn't be done in case of
vectorization failure can be done on the for vectorizer only copy of the
loop.

Jakub



Re: [RFC] Teach vectorizer to deal with bitfield reads

2022-07-29 Thread Richard Biener via Gcc-patches
On Fri, 29 Jul 2022, Andre Vieira (lists) wrote:

> Hi Richard,
> 
> Thanks for the review, I don't completely understand all of the below, so I
> added some extra questions to help me understand :)
> 
> On 27/07/2022 12:37, Richard Biener wrote:
> > On Tue, 26 Jul 2022, Andre Vieira (lists) wrote:
> >
> > I don't think this is a good approach for what you gain and how
> > necessarily limited it will be.  Similar to the recent experiment with
> > handling _Complex loads/stores this is much better tackled by lowering
> > things earlier (it will be lowered at RTL expansion time).
> I assume the approach you are referring to here is the lowering of the
> BIT_FIELD_DECL to BIT_FIELD_REF in the vect_recog part of the vectorizer. I am
> all for lowering earlier, the reason I did it there was as a 'temporary'
> approach until we have that earlier loading.

I understood, but "temporary" things in GCC tend to be still around
10 years later, so ...

> >
> > One place to do this experimentation would be to piggy-back on the
> > if-conversion pass so the lowering would happen only on the
> > vectorized code path.
> This was one of my initial thoughts, though the if-conversion changes are a
> bit more intrusive for a temporary approach and not that much earlier. It does
> however have the added benefit of not having to make any changes to the
> vectorizer itself later if we do do the earlier lowering, assuming the
> lowering results in the same.
> 
> The 'only on the vectorized code path' remains the same though as vect_recog
> also only happens on the vectorized code path right?
> >   Note that the desired lowering would look like
> > the following for reads:
> >
> >_1 = a.b;
> >
> > to
> >
> >_2 = a.;
> >_1 = BIT_FIELD_REF <2, ...>; // extract bits
> I don't yet have a well formed idea of what '' is
> supposed to look like in terms of tree expressions. I understand what it's
> supposed to be representing, the 'larger than bit-field'-load. But is it going
> to be a COMPONENT_REF with a fake 'FIELD_DECL' with the larger size? Like I
> said on IRC, the description of BIT_FIELD_REF makes it sound like this isn't
> how we are supposed to use it, are we intending to make a change to that here?

 is what DECL_BIT_FIELD_REPRESENTATIVE 
(FIELD_DECL-for-b) gives you, it is a "fake" FIELD_DECL for the underlying
storage, conveniently made available to you by the middle-end.  For
your 31bit field it would be simply 'int' typed.

The BIT_FIELD_REF then extracts the actual bitfield from the underlying
storage, but it's now no longer operating on memory but on the register
holding the underlying data.  To the vectorizer we'd probably have to
pattern-match this to shifts & masks and hope for the conversion to
combine with a later one.

> > and for writes:
> >
> >a.b = _1;
> >
> > to
> >
> >_2 = a.;
> >_3 = BIT_INSERT_EXPR <_2, _1, ...>; // insert bits
> >a. = _3;
> I was going to avoid writes for now because they are somewhat more
> complicated, but maybe it's not that bad, I'll add them too.

Only handling loads at start is probably fine as experiment, but
handling stores should be straight forward - of course the
BIT_INSERT_EXPR lowering to shifts & masks will be more
complicated.

> > so you trade now handled loads/stores with not handled
> > BIT_FIELD_REF / BIT_INSERT_EXPR which you would then need to
> > pattern match to shifts and logical ops in the vectorizer.
> Yeah that vect_recog pattern already exists in my RFC patch, though I can
> probably simplify it by moving the bit-field-ref stuff to ifcvt.
> >
> > There's a separate thing of actually promoting all uses, for
> > example
> >
> > struct { long long x : 33; } a;
> >
> >   a.a = a.a + 1;
> >
> > will get you 33bit precision adds (for bit fields less than 32bits
> > they get promoted to int but not for larger bit fields).  RTL
> > expansion again will rewrite this into larger ops plus masking.
> Not sure I understand why this is relevant here? The current way I am doing
> this would likely lower a  bit-field like that to a 64-bit load  followed by
> the masking away of the top 31 bits, same would happen with a ifcvt-lowering
> approach.

Yes, but if there's anything besides loading or storing you will have
a conversion from, say int:31 to int in the IL before any arithmetic.
I've not looked but your patch probably handles conversions to/from
bitfield types by masking / extending.  What I've mentioned with the
33bit example is that with that you can have arithmetic in 33 bits
_without_ intermediate conversions, so you'd have to properly truncate
after every such operation (or choose not to vectorize which I think
is what would happen now).

> >
> > So I think the time is better spent in working on the lowering of
> > bitfield accesses, if sufficiently separated it could be used
> > from if-conversion by working on loop SEME regions.
> I will start to look at modifying ifcvt to add the lowering there. Will likely
> require two pass though 

Re: [RFC] Teach vectorizer to deal with bitfield reads

2022-07-29 Thread Andre Vieira (lists) via Gcc-patches

Hi Richard,

Thanks for the review, I don't completely understand all of the below, 
so I added some extra questions to help me understand :)


On 27/07/2022 12:37, Richard Biener wrote:

On Tue, 26 Jul 2022, Andre Vieira (lists) wrote:

I don't think this is a good approach for what you gain and how
necessarily limited it will be.  Similar to the recent experiment with
handling _Complex loads/stores this is much better tackled by lowering
things earlier (it will be lowered at RTL expansion time).
I assume the approach you are referring to here is the lowering of the 
BIT_FIELD_DECL to BIT_FIELD_REF in the vect_recog part of the 
vectorizer. I am all for lowering earlier, the reason I did it there was 
as a 'temporary' approach until we have that earlier loading.


One place to do this experimentation would be to piggy-back on the
if-conversion pass so the lowering would happen only on the
vectorized code path.
This was one of my initial thoughts, though the if-conversion changes 
are a bit more intrusive for a temporary approach and not that much 
earlier. It does however have the added benefit of not having to make 
any changes to the vectorizer itself later if we do do the earlier 
lowering, assuming the lowering results in the same.


The 'only on the vectorized code path' remains the same though as 
vect_recog also only happens on the vectorized code path right?

  Note that the desired lowering would look like
the following for reads:

   _1 = a.b;

to

   _2 = a.;
   _1 = BIT_FIELD_REF <2, ...>; // extract bits
I don't yet have a well formed idea of what '' is 
supposed to look like in terms of tree expressions. I understand what 
it's supposed to be representing, the 'larger than bit-field'-load. But 
is it going to be a COMPONENT_REF with a fake 'FIELD_DECL' with the 
larger size? Like I said on IRC, the description of BIT_FIELD_REF makes 
it sound like this isn't how we are supposed to use it, are we intending 
to make a change to that here?



and for writes:

   a.b = _1;

to

   _2 = a.;
   _3 = BIT_INSERT_EXPR <_2, _1, ...>; // insert bits
   a. = _3;
I was going to avoid writes for now because they are somewhat more 
complicated, but maybe it's not that bad, I'll add them too.

so you trade now handled loads/stores with not handled
BIT_FIELD_REF / BIT_INSERT_EXPR which you would then need to
pattern match to shifts and logical ops in the vectorizer.
Yeah that vect_recog pattern already exists in my RFC patch, though I 
can probably simplify it by moving the bit-field-ref stuff to ifcvt.


There's a separate thing of actually promoting all uses, for
example

struct { long long x : 33; } a;

  a.a = a.a + 1;

will get you 33bit precision adds (for bit fields less than 32bits
they get promoted to int but not for larger bit fields).  RTL
expansion again will rewrite this into larger ops plus masking.
Not sure I understand why this is relevant here? The current way I am 
doing this would likely lower a  bit-field like that to a 64-bit load  
followed by the masking away of the top 31 bits, same would happen with 
a ifcvt-lowering approach.


So I think the time is better spent in working on the lowering of
bitfield accesses, if sufficiently separated it could be used
from if-conversion by working on loop SEME regions.
I will start to look at modifying ifcvt to add the lowering there. Will 
likely require two pass though because we can no longer look at the 
number of BBs to determine whether ifcvt is even needed, so we will 
first need to look for bit-field-decls, then version the loops and then 
look for them again for transformation, but I guess that's fine?

The patches
doing previous implementations are probably not too useful anymore
(I find one from 2011 and one from 2016, both pre-dating BIT_INSERT_EXPR)

Richard.


Re: [RFC] Teach vectorizer to deal with bitfield reads

2022-07-27 Thread Richard Biener via Gcc-patches
On Tue, 26 Jul 2022, Andre Vieira (lists) wrote:

> Hi,
> 
> This is a RFC for my prototype for bitfield read vectorization. This patch
> enables bit-field read vectorization by removing the rejection of bit-field
> read's during DR analysis and by adding two vect patterns. The first one
> transforms TREE_COMPONENT's with BIT_FIELD_DECL's into BIT_FIELD_REF's, this
> is a temporary one as I believe there are plans to do this lowering earlier in
> the compiler. To avoid having to wait for that to happen we decided to
> implement this temporary vect pattern.
> The second one looks for conversions of the result of BIT_FIELD_REF's from a
> 'partial' type to a 'full-type' and transforms it into a 'full-type' load
> followed by the necessary shifting and masking.
> 
> The patch is not perfect, one thing I'd like to change for instance is the way
> the 'full-type' load is represented. I currently abuse the fact that the
> vectorizer transforms the original TREE_COMPONENT with a BIT_FIELD_DECL into a
> full-type vector load, because it uses the smallest mode necessary for that
> precision. The reason I do this is because I was struggling to construct a
> MEM_REF that the vectorizer would accept and this for some reason seemed to
> work ... I'd appreciate some pointers on how to do this properly :)
> 
> Another aspect that I haven't paid much attention to yet is costing, I've
> noticed some testcases fail to vectorize due to costing where I think it might
> be wrong, but like I said, I haven't paid much attention to it.
> 
> Finally another aspect I'd eventually like to tackle is the sinking of the
> masking when possible, for instance in bit-field-read-3.c the 'masking' does
> not need to be inside the loop because we are doing bitwise operations. Though
> I suspect we are better off looking at things like this elsewhere, maybe where
> we do codegen for the reduction... Haven't looked at this at all yet.
> 
> Let me know if you believe this is a good approach? I've ran regression tests
> and this hasn't broken anything so far...

I don't think this is a good approach for what you gain and how 
necessarily limited it will be.  Similar to the recent experiment with
handling _Complex loads/stores this is much better tackled by lowering
things earlier (it will be lowered at RTL expansion time).

One place to do this experimentation would be to piggy-back on the
if-conversion pass so the lowering would happen only on the
vectorized code path.  Note that the desired lowering would look like
the following for reads:

  _1 = a.b;

to

  _2 = a.;
  _1 = BIT_FIELD_REF <2, ...>; // extract bits

and for writes:

  a.b = _1;

to

  _2 = a.;
  _3 = BIT_INSERT_EXPR <_2, _1, ...>; // insert bits
  a. = _3;

so you trade now handled loads/stores with not handled
BIT_FIELD_REF / BIT_INSERT_EXPR which you would then need to
pattern match to shifts and logical ops in the vectorizer.

There's a separate thing of actually promoting all uses, for
example

struct { long long x : 33; } a;

 a.a = a.a + 1;

will get you 33bit precision adds (for bit fields less than 32bits
they get promoted to int but not for larger bit fields).  RTL
expansion again will rewrite this into larger ops plus masking.

So I think the time is better spent in working on the lowering of
bitfield accesses, if sufficiently separated it could be used
from if-conversion by working on loop SEME regions.  The patches
doing previous implementations are probably not too useful anymore
(I find one from 2011 and one from 2016, both pre-dating BIT_INSERT_EXPR)

Richard.