from:"Stefan Schulze Frielinghaus via Gcc\-patches"

Re: [PATCH] s390: Fix builtins vec_rli and verll

2023-09-11 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Mon, Aug 28, 2023 at 11:33:37AM +0200, Andreas Krebbel wrote:
> Hi Stefan,
> 
> do you really need to introduce a new flag for U64 given that the type of the 
> builtin is unsigned long?

In function s390_const_operand_ok the immediate is checked whether it is
valide w.r.t. the flag:

  tree_to_uhwi (arg) > ((HOST_WIDE_INT_1U << (bitwidth - 1) << 1) - 1)

Here bitwidth is derived from the flag.

Cheers,
Stefan

> 
> Andreas
> 
> On 8/21/23 17:56, Stefan Schulze Frielinghaus wrote:
> > The second argument of these builtins is an unsigned immediate.  For
> > vec_rli the API allows immediates up to 64 bits whereas the instruction
> > verll only allows immediates up to 32 bits.  Since the shift count
> > equals the immediate modulo vector element size, truncating those
> > immediates is fine.
> > 
> > Bootstrapped and regtested on s390.  Ok for mainline?
> > 
> > gcc/ChangeLog:
> > 
> > * config/s390/s390-builtins.def (O_U64): New.
> > (O1_U64): Ditto.
> > (O2_U64): Ditto.
> > (O3_U64): Ditto.
> > (O4_U64): Ditto.
> > (O_M12): Change bit position.
> > (O_S2): Ditto.
> > (O_S3): Ditto.
> > (O_S4): Ditto.
> > (O_S5): Ditto.
> > (O_S8): Ditto.
> > (O_S12): Ditto.
> > (O_S16): Ditto.
> > (O_S32): Ditto.
> > (O_ELEM): Ditto.
> > (O_LIT): Ditto.
> > (OB_DEF_VAR): Add operand constraints.
> > (B_DEF): Ditto.
> > * config/s390/s390.cc (s390_const_operand_ok): Honour 64 bit
> > operands.
> > ---
> >  gcc/config/s390/s390-builtins.def | 60 ++-
> >  gcc/config/s390/s390.cc   |  6 ++--
> >  2 files changed, 39 insertions(+), 27 deletions(-)
> > 
> > diff --git a/gcc/config/s390/s390-builtins.def 
> > b/gcc/config/s390/s390-builtins.def
> > index a16983b18bd..c829f445a11 100644
> > --- a/gcc/config/s390/s390-builtins.def
> > +++ b/gcc/config/s390/s390-builtins.def
> > @@ -28,6 +28,7 @@
> >  #undef O_U12
> >  #undef O_U16
> >  #undef O_U32
> > +#undef O_U64
> >  
> >  #undef O_M12
> >  
> > @@ -88,6 +89,11 @@
> >  #undef O3_U32
> >  #undef O4_U32
> >  
> > +#undef O1_U64
> > +#undef O2_U64
> > +#undef O3_U64
> > +#undef O4_U64
> > +
> >  #undef O1_M12
> >  #undef O2_M12
> >  #undef O3_M12
> > @@ -157,20 +163,21 @@
> >  #define O_U127 /* unsigned 16 bit literal */
> >  #define O_U168 /* unsigned 16 bit literal */
> >  #define O_U329 /* unsigned 32 bit literal */
> > +#define O_U64   10 /* unsigned 64 bit literal */
> >  
> > -#define O_M12   10 /* matches bitmask of 12 */
> > +#define O_M12   11 /* matches bitmask of 12 */
> >  
> > -#define O_S211 /* signed  2 bit literal */
> > -#define O_S312 /* signed  3 bit literal */
> > -#define O_S413 /* signed  4 bit literal */
> > -#define O_S514 /* signed  5 bit literal */
> > -#define O_S815 /* signed  8 bit literal */
> > -#define O_S12   16 /* signed 12 bit literal */
> > -#define O_S16   17 /* signed 16 bit literal */
> > -#define O_S32   18 /* signed 32 bit literal */
> > +#define O_S212 /* signed  2 bit literal */
> > +#define O_S313 /* signed  3 bit literal */
> > +#define O_S414 /* signed  4 bit literal */
> > +#define O_S515 /* signed  5 bit literal */
> > +#define O_S816 /* signed  8 bit literal */
> > +#define O_S12   17 /* signed 12 bit literal */
> > +#define O_S16   18 /* signed 16 bit literal */
> > +#define O_S32   19 /* signed 32 bit literal */
> >  
> > -#define O_ELEM  19 /* Element selector requiring modulo arithmetic. */
> > -#define O_LIT   20 /* Operand must be a literal fitting the target type.  
> > */
> > +#define O_ELEM  20 /* Element selector requiring modulo arithmetic. */
> > +#define O_LIT   21 /* Operand must be a literal fitting the target type.  
> > */
> >  
> >  #define O_SHIFT 5
> >  
> > @@ -223,6 +230,11 @@
> >  #define O3_U32 (O_U32 << (2 * O_SHIFT))
> >  #define O4_U32 (O_U32 << (3 * O_SHIFT))
> >  
> > +#define O1_U64 O_U64
> > +#define O2_U64 (O_U64 << O_SHIFT)
> > +#define O3_U64 (O_U64 << (2 * O_SHIFT))
> > +#define O4_U64 (O_U64 << (3 * O_SHIFT))
> > +
> >  #define O1_M12 O_M12
> >  #define O2_M12 (O_M12 << O_SHIFT)
> >  #define O3_M12 (O_M12 << (2 * O_SHIFT))
> > @@ -1989,19 +2001,19 @@ B_DEF  (s390_verllvf,   vrotlv4si3, 
> > 0,
> >  B_DEF  (s390_verllvg,   vrotlv2di3, 0, 
> >  B_VX,   0,  BT_FN_UV2DI_UV2DI_UV2DI)
> >  
> >  OB_DEF (s390_vec_rli,   s390_vec_rli_u8,
> > s390_vec_rli_s64,   B_VX,   BT_FN_OV4SI_OV4SI_ULONG)
> > -OB_DEF_VAR (s390_vec_rli_u8,s390_verllb,0, 
> >  0,  BT_OV_UV16QI_UV16QI_ULONG)
> > -OB_DEF_VAR (s390_vec_rli_s8,s390_verllb,0, 
> >  0,  BT_OV_V16QI_V16QI_ULONG)
> > -OB_DEF_VAR (s390_vec_rli_u16,   s390_verllh,0, 
> >  0,  BT_OV_UV8HI_UV8HI_ULONG)
> > -OB_DEF_VAR

Re: PING^3: [PATCH] rtl-optimization/110939 Really fix narrow comparison of memory and constant

2023-09-03 Thread Stefan Schulze Frielinghaus via Gcc-patches

Ping.

On Thu, Aug 24, 2023 at 11:31:32AM +0800, Xi Ruoyao wrote:
> Ping again.
> 
> On Fri, 2023-08-18 at 13:04 +0200, Stefan Schulze Frielinghaus via 
> Gcc-patches wrote:
> > Ping.  Since this fixes bootstrap problem PR110939 for Loongarch I'm
> > pingen this one earlier.
> > 
> > On Thu, Aug 10, 2023 at 03:04:03PM +0200, Stefan Schulze Frielinghaus wrote:
> > > In the former fix in commit 41ef5a34161356817807be3a2e51fbdbe575ae85 I
> > > completely missed the fact that the normal form of a generated constant 
> > > for a
> > > mode with fewer bits than in HOST_WIDE_INT is a sign extended version of 
> > > the
> > > actual constant.  This even holds true for unsigned constants.
> > > 
> > > Fixed by masking out the upper bits for the incoming constant and sign
> > > extending the resulting unsigned constant.
> > > 
> > > Bootstrapped and regtested on x64 and s390x.  Ok for mainline?
> > > 
> > > While reading existing optimizations in combine I stumbled across two
> > > optimizations where either my intuition about the representation of
> > > unsigned integers via a const_int rtx is wrong, which then in turn would
> > > probably also mean that this patch is wrong, or that the optimizations
> > > are missed sometimes.  In other words in the following I would assume
> > > that the upper bits are masked out:
> > > 
> > > diff --git a/gcc/combine.cc b/gcc/combine.cc
> > > index 468b7fde911..80c4ff0fbaf 100644
> > > --- a/gcc/combine.cc
> > > +++ b/gcc/combine.cc
> > > @@ -11923,7 +11923,7 @@ simplify_compare_const (enum rtx_code code, 
> > > machine_mode mode,
> > >    /* (unsigned) < 0x8000 is equivalent to >= 0.  */
> > >    else if (is_a  (mode, _mode)
> > >    && GET_MODE_PRECISION (int_mode) - 1 < 
> > > HOST_BITS_PER_WIDE_INT
> > > -  && ((unsigned HOST_WIDE_INT) const_op
> > > +  && (((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK 
> > > (int_mode))
> > >    == HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) 
> > > - 1)))
> > >     {
> > >   const_op = 0;
> > > @@ -11962,7 +11962,7 @@ simplify_compare_const (enum rtx_code code, 
> > > machine_mode mode,
> > >    /* (unsigned) >= 0x8000 is equivalent to < 0.  */
> > >    else if (is_a  (mode, _mode)
> > >    && GET_MODE_PRECISION (int_mode) - 1 < 
> > > HOST_BITS_PER_WIDE_INT
> > > -  && ((unsigned HOST_WIDE_INT) const_op
> > > +  && (((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK 
> > > (int_mode))
> > >    == HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) 
> > > - 1)))
> > >     {
> > >   const_op = 0;
> > > 
> > > For example, while bootstrapping on x64 the optimization is missed since
> > > a LTU comparison in QImode is done and the constant equals
> > > 0xff80.
> > > 
> > > Sorry for inlining another patch, but I would really like to make sure
> > > that my understanding is correct, now, before I come up with another
> > > patch.  Thus it would be great if someone could shed some light on this.
> > > 
> > > gcc/ChangeLog:
> > > 
> > > * combine.cc (simplify_compare_const): Properly handle unsigned
> > > constants while narrowing comparison of memory and constants.
> > > ---
> > >  gcc/combine.cc | 19 ++-
> > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/gcc/combine.cc b/gcc/combine.cc
> > > index e46d202d0a7..468b7fde911 100644
> > > --- a/gcc/combine.cc
> > > +++ b/gcc/combine.cc
> > > @@ -12003,14 +12003,15 @@ simplify_compare_const (enum rtx_code code, 
> > > machine_mode mode,
> > >    && !MEM_VOLATILE_P (op0)
> > >    /* The optimization makes only sense for constants which are big 
> > > enough
> > >  so that we have a chance to chop off something at all.  */
> > > -  && (unsigned HOST_WIDE_INT) const_op > 0xff
> > > -  /* Bail out, if the constant does not fit into INT_MODE.  */
> > > -  && (unsigned HOST_WIDE_INT) const_op
> > > -    < ((HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 1) << 
>

[PATCH] s390: Fix builtins vec_rli and verll

2023-08-21 Thread Stefan Schulze Frielinghaus via Gcc-patches

The second argument of these builtins is an unsigned immediate.  For
vec_rli the API allows immediates up to 64 bits whereas the instruction
verll only allows immediates up to 32 bits.  Since the shift count
equals the immediate modulo vector element size, truncating those
immediates is fine.

Bootstrapped and regtested on s390.  Ok for mainline?

gcc/ChangeLog:

* config/s390/s390-builtins.def (O_U64): New.
(O1_U64): Ditto.
(O2_U64): Ditto.
(O3_U64): Ditto.
(O4_U64): Ditto.
(O_M12): Change bit position.
(O_S2): Ditto.
(O_S3): Ditto.
(O_S4): Ditto.
(O_S5): Ditto.
(O_S8): Ditto.
(O_S12): Ditto.
(O_S16): Ditto.
(O_S32): Ditto.
(O_ELEM): Ditto.
(O_LIT): Ditto.
(OB_DEF_VAR): Add operand constraints.
(B_DEF): Ditto.
* config/s390/s390.cc (s390_const_operand_ok): Honour 64 bit
operands.
---
 gcc/config/s390/s390-builtins.def | 60 ++-
 gcc/config/s390/s390.cc   |  6 ++--
 2 files changed, 39 insertions(+), 27 deletions(-)

diff --git a/gcc/config/s390/s390-builtins.def 
b/gcc/config/s390/s390-builtins.def
index a16983b18bd..c829f445a11 100644
--- a/gcc/config/s390/s390-builtins.def
+++ b/gcc/config/s390/s390-builtins.def
@@ -28,6 +28,7 @@
 #undef O_U12
 #undef O_U16
 #undef O_U32
+#undef O_U64
 
 #undef O_M12
 
@@ -88,6 +89,11 @@
 #undef O3_U32
 #undef O4_U32
 
+#undef O1_U64
+#undef O2_U64
+#undef O3_U64
+#undef O4_U64
+
 #undef O1_M12
 #undef O2_M12
 #undef O3_M12
@@ -157,20 +163,21 @@
 #define O_U127 /* unsigned 16 bit literal */
 #define O_U168 /* unsigned 16 bit literal */
 #define O_U329 /* unsigned 32 bit literal */
+#define O_U64   10 /* unsigned 64 bit literal */
 
-#define O_M12   10 /* matches bitmask of 12 */
+#define O_M12   11 /* matches bitmask of 12 */
 
-#define O_S211 /* signed  2 bit literal */
-#define O_S312 /* signed  3 bit literal */
-#define O_S413 /* signed  4 bit literal */
-#define O_S514 /* signed  5 bit literal */
-#define O_S815 /* signed  8 bit literal */
-#define O_S12   16 /* signed 12 bit literal */
-#define O_S16   17 /* signed 16 bit literal */
-#define O_S32   18 /* signed 32 bit literal */
+#define O_S212 /* signed  2 bit literal */
+#define O_S313 /* signed  3 bit literal */
+#define O_S414 /* signed  4 bit literal */
+#define O_S515 /* signed  5 bit literal */
+#define O_S816 /* signed  8 bit literal */
+#define O_S12   17 /* signed 12 bit literal */
+#define O_S16   18 /* signed 16 bit literal */
+#define O_S32   19 /* signed 32 bit literal */
 
-#define O_ELEM  19 /* Element selector requiring modulo arithmetic. */
-#define O_LIT   20 /* Operand must be a literal fitting the target type.  */
+#define O_ELEM  20 /* Element selector requiring modulo arithmetic. */
+#define O_LIT   21 /* Operand must be a literal fitting the target type.  */
 
 #define O_SHIFT 5
 
@@ -223,6 +230,11 @@
 #define O3_U32 (O_U32 << (2 * O_SHIFT))
 #define O4_U32 (O_U32 << (3 * O_SHIFT))
 
+#define O1_U64 O_U64
+#define O2_U64 (O_U64 << O_SHIFT)
+#define O3_U64 (O_U64 << (2 * O_SHIFT))
+#define O4_U64 (O_U64 << (3 * O_SHIFT))
+
 #define O1_M12 O_M12
 #define O2_M12 (O_M12 << O_SHIFT)
 #define O3_M12 (O_M12 << (2 * O_SHIFT))
@@ -1989,19 +2001,19 @@ B_DEF  (s390_verllvf,   vrotlv4si3, 
0,
 B_DEF  (s390_verllvg,   vrotlv2di3, 0, 
 B_VX,   0,  BT_FN_UV2DI_UV2DI_UV2DI)
 
 OB_DEF (s390_vec_rli,   s390_vec_rli_u8,s390_vec_rli_s64,  
 B_VX,   BT_FN_OV4SI_OV4SI_ULONG)
-OB_DEF_VAR (s390_vec_rli_u8,s390_verllb,0, 
 0,  BT_OV_UV16QI_UV16QI_ULONG)
-OB_DEF_VAR (s390_vec_rli_s8,s390_verllb,0, 
 0,  BT_OV_V16QI_V16QI_ULONG)
-OB_DEF_VAR (s390_vec_rli_u16,   s390_verllh,0, 
 0,  BT_OV_UV8HI_UV8HI_ULONG)
-OB_DEF_VAR (s390_vec_rli_s16,   s390_verllh,0, 
 0,  BT_OV_V8HI_V8HI_ULONG)
-OB_DEF_VAR (s390_vec_rli_u32,   s390_verllf,0, 
 0,  BT_OV_UV4SI_UV4SI_ULONG)
-OB_DEF_VAR (s390_vec_rli_s32,   s390_verllf,0, 
 0,  BT_OV_V4SI_V4SI_ULONG)
-OB_DEF_VAR (s390_vec_rli_u64,   s390_verllg,0, 
 0,  BT_OV_UV2DI_UV2DI_ULONG)
-OB_DEF_VAR (s390_vec_rli_s64,   s390_verllg,0, 
 0,  BT_OV_V2DI_V2DI_ULONG)
-
-B_DEF  (s390_verllb,rotlv16qi3, 0, 
 B_VX,   0,  BT_FN_UV16QI_UV16QI_UINT)
-B_DEF  (s390_verllh,rotlv8hi3,  0, 
 B_VX,   0,

[PATCH] s390: Fix some builtin definitions

2023-08-21 Thread Stefan Schulze Frielinghaus via Gcc-patches

Bootstrapped and regtested on s390.  Ok for mainline?

gcc/ChangeLog:

* config/s390/s390-builtins.def (s390_vec_signed_flt): Fix
builtin flag.
(s390_vec_unsigned_flt): Ditto.
(s390_vec_revb_flt): Ditto.
(s390_vec_reve_flt): Ditto.
(s390_vclfnhs): Fix operand flags.
(s390_vclfnls): Ditto.
(s390_vcrnfs): Ditto.
(s390_vcfn): Ditto.
(s390_vcnf): Ditto.
---
 gcc/config/s390/s390-builtins.def | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/gcc/config/s390/s390-builtins.def 
b/gcc/config/s390/s390-builtins.def
index c829f445a11..964d86c74a0 100644
--- a/gcc/config/s390/s390-builtins.def
+++ b/gcc/config/s390/s390-builtins.def
@@ -2846,12 +2846,12 @@ B_DEF  (s390_vcelfb,
floatunsv4siv4sf2,  0,
 B_DEF  (s390_vcdlgb,floatunsv2div2df2,  0, 
 B_VX,   O2_U4 | O3_U3,  BT_FN_V2DF_UV2DI)
 
 OB_DEF (s390_vec_signed,
s390_vec_signed_flt,s390_vec_signed_dbl,B_VX,   BT_FN_OV4SI_OV4SI)
-OB_DEF_VAR (s390_vec_signed_flt,s390_vcfeb, 0, 
 B_VXE2, BT_OV_V4SI_V4SF)
+OB_DEF_VAR (s390_vec_signed_flt,s390_vcfeb, B_VXE2,
 0,  BT_OV_V4SI_V4SF)
 OB_DEF_VAR (s390_vec_signed_dbl,s390_vcgdb, 0, 
 0,  BT_OV_V2DI_V2DF)
 
 OB_DEF (s390_vec_unsigned,  
s390_vec_unsigned_flt,s390_vec_unsigned_dbl,B_VX,   BT_FN_OV4SI_OV4SI)
-OB_DEF_VAR (s390_vec_unsigned_flt,  s390_vclfeb,0,
B_VXE2, BT_OV_UV4SI_V4SF)
-OB_DEF_VAR (s390_vec_unsigned_dbl,  s390_vclgdb,0,
0,  BT_OV_UV2DI_V2DF)
+OB_DEF_VAR (s390_vec_unsigned_flt,  s390_vclfeb,B_VXE2,
 0,  BT_OV_UV4SI_V4SF)
+OB_DEF_VAR (s390_vec_unsigned_dbl,  s390_vclgdb,0, 
 0,  BT_OV_UV2DI_V2DF)
 
 B_DEF  (s390_vcfeb, fix_truncv4sfv4si2, 0, 
 B_VXE2, O2_U4 | O3_U3,  BT_FN_V4SI_V4SF)
 B_DEF  (s390_vcgdb, fix_truncv2dfv2di2, 0, 
 B_VX,   O2_U4 | O3_U3,  BT_FN_V2DI_V2DF)
@@ -2929,7 +2929,7 @@ OB_DEF_VAR (s390_vec_revb_s32,  s390_vlbrf,   
  0,
 OB_DEF_VAR (s390_vec_revb_u32,  s390_vlbrf, 0, 
 0,  BT_OV_UV4SI_UV4SI)
 OB_DEF_VAR (s390_vec_revb_s64,  s390_vlbrg, 0, 
 0,  BT_OV_V2DI_V2DI)
 OB_DEF_VAR (s390_vec_revb_u64,  s390_vlbrg, 0, 
 0,  BT_OV_UV2DI_UV2DI)
-OB_DEF_VAR (s390_vec_revb_flt,  s390_vlbrf_flt, 0, 
 B_VXE,  BT_OV_V4SF_V4SF)
+OB_DEF_VAR (s390_vec_revb_flt,  s390_vlbrf_flt, B_VXE, 
 0,  BT_OV_V4SF_V4SF)
 OB_DEF_VAR (s390_vec_revb_dbl,  s390_vlbrg_dbl, 0, 
 0,  BT_OV_V2DF_V2DF)
 
 B_DEF  (s390_vlbrh, bswapv8hi,  0, 
 B_VX,   0,   BT_FN_V8HI_V8HI)
@@ -2960,7 +2960,7 @@ OB_DEF_VAR (s390_vec_reve_u32,  s390_vlerf,   
  0,
 OB_DEF_VAR (s390_vec_reve_b64,  s390_vlerg, 0, 
 0,  BT_OV_BV2DI_BV2DI)
 OB_DEF_VAR (s390_vec_reve_s64,  s390_vlerg, 0, 
 0,  BT_OV_V2DI_V2DI)
 OB_DEF_VAR (s390_vec_reve_u64,  s390_vlerg, 0, 
 0,  BT_OV_UV2DI_UV2DI)
-OB_DEF_VAR (s390_vec_reve_flt,  s390_vlerf_flt, 0, 
 B_VXE,  BT_OV_V4SF_V4SF)
+OB_DEF_VAR (s390_vec_reve_flt,  s390_vlerf_flt, B_VXE, 
 0,  BT_OV_V4SF_V4SF)
 OB_DEF_VAR (s390_vec_reve_dbl,  s390_vlerg_dbl, 0, 
 0,  BT_OV_V2DF_V2DF)
 
 B_DEF  (s390_vlerb, eltswapv16qi,   0, 
 B_VX,   0,   BT_FN_V16QI_V16QI)
@@ -3037,10 +3037,10 @@ B_DEF  (s390_vstrszf,vstrszv4si,
0,
 
 /* arch 14 builtins */
 
-B_DEF  (s390_vclfnhs,vclfnhs_v8hi,  0, 
 B_NNPA, O3_U4,  BT_FN_V4SF_V8HI_UINT)
-B_DEF  (s390_vclfnls,vclfnls_v8hi,  0, 
 B_NNPA, O3_U4,  BT_FN_V4SF_V8HI_UINT)
+B_DEF  (s390_vclfnhs,vclfnhs_v8hi,  0, 
 B_NNPA, O2_U4,  BT_FN_V4SF_V8HI_UINT)
+B_DEF  (s390_vclfnls,vclfnls_v8hi,  0, 
 B_NNPA, O2_U4,  BT_FN_V4SF_V8HI_UINT)
 
-B_DEF  (s390_vcrnfs, vcrnfs_v8hi,   0,

Re: [PATCH] rtl-optimization/110939 Really fix narrow comparison of memory and constant

2023-08-18 Thread Stefan Schulze Frielinghaus via Gcc-patches

Ping.  Since this fixes bootstrap problem PR110939 for Loongarch I'm
pingen this one earlier.

On Thu, Aug 10, 2023 at 03:04:03PM +0200, Stefan Schulze Frielinghaus wrote:
> In the former fix in commit 41ef5a34161356817807be3a2e51fbdbe575ae85 I
> completely missed the fact that the normal form of a generated constant for a
> mode with fewer bits than in HOST_WIDE_INT is a sign extended version of the
> actual constant.  This even holds true for unsigned constants.
> 
> Fixed by masking out the upper bits for the incoming constant and sign
> extending the resulting unsigned constant.
> 
> Bootstrapped and regtested on x64 and s390x.  Ok for mainline?
> 
> While reading existing optimizations in combine I stumbled across two
> optimizations where either my intuition about the representation of
> unsigned integers via a const_int rtx is wrong, which then in turn would
> probably also mean that this patch is wrong, or that the optimizations
> are missed sometimes.  In other words in the following I would assume
> that the upper bits are masked out:
> 
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index 468b7fde911..80c4ff0fbaf 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -11923,7 +11923,7 @@ simplify_compare_const (enum rtx_code code, 
> machine_mode mode,
>/* (unsigned) < 0x8000 is equivalent to >= 0.  */
>else if (is_a  (mode, _mode)
>&& GET_MODE_PRECISION (int_mode) - 1 < HOST_BITS_PER_WIDE_INT
> -  && ((unsigned HOST_WIDE_INT) const_op
> +  && (((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK 
> (int_mode))
>== HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 
> 1)))
> {
>   const_op = 0;
> @@ -11962,7 +11962,7 @@ simplify_compare_const (enum rtx_code code, 
> machine_mode mode,
>/* (unsigned) >= 0x8000 is equivalent to < 0.  */
>else if (is_a  (mode, _mode)
>&& GET_MODE_PRECISION (int_mode) - 1 < HOST_BITS_PER_WIDE_INT
> -  && ((unsigned HOST_WIDE_INT) const_op
> +  && (((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK 
> (int_mode))
>== HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 
> 1)))
> {
>   const_op = 0;
> 
> For example, while bootstrapping on x64 the optimization is missed since
> a LTU comparison in QImode is done and the constant equals
> 0xff80.
> 
> Sorry for inlining another patch, but I would really like to make sure
> that my understanding is correct, now, before I come up with another
> patch.  Thus it would be great if someone could shed some light on this.
> 
> gcc/ChangeLog:
> 
>   * combine.cc (simplify_compare_const): Properly handle unsigned
>   constants while narrowing comparison of memory and constants.
> ---
>  gcc/combine.cc | 19 ++-
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index e46d202d0a7..468b7fde911 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -12003,14 +12003,15 @@ simplify_compare_const (enum rtx_code code, 
> machine_mode mode,
>&& !MEM_VOLATILE_P (op0)
>/* The optimization makes only sense for constants which are big enough
>so that we have a chance to chop off something at all.  */
> -  && (unsigned HOST_WIDE_INT) const_op > 0xff
> -  /* Bail out, if the constant does not fit into INT_MODE.  */
> -  && (unsigned HOST_WIDE_INT) const_op
> -  < ((HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 1) << 1) - 1)
> +  && ((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode)) > 
> 0xff
>/* Ensure that we do not overflow during normalization.  */
> -  && (code != GTU || (unsigned HOST_WIDE_INT) const_op < 
> HOST_WIDE_INT_M1U))
> +  && (code != GTU
> +   || ((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode))
> +  < HOST_WIDE_INT_M1U)
> +  && trunc_int_for_mode (const_op, int_mode) == const_op)
>  {
> -  unsigned HOST_WIDE_INT n = (unsigned HOST_WIDE_INT) const_op;
> +  unsigned HOST_WIDE_INT n
> + = (unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode);
>enum rtx_code adjusted_code;
>  
>/* Normalize code to either LEU or GEU.  */
> @@ -12051,15 +12052,15 @@ simplify_compare_const (enum rtx_code code, 
> machine_mode mode,
>   HOST_WIDE_INT_PRINT_HEX ") to (MEM %s "
>   HOST_WIDE_INT_PRINT_HEX ").\n", GET_MODE_NAME (int_mode),
>   GET_MODE_NAME (narrow_mode_iter), GET_RTX_NAME (code),
> - (unsigned HOST_WIDE_INT)const_op, GET_RTX_NAME (adjusted_code),
> - n);
> + (unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode),
> + GET_RTX_NAME (adjusted_code), n);
>   }
> poly_int64 offset = (BYTES_BIG_ENDIAN
>  ? 0
>  : (GET_MODE_SIZE (int_mode)
>

Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-14 Thread Stefan Schulze Frielinghaus via Gcc-patches

Hi everyone,

I have bootstrapped and regtested the patch below on s390.  For the
64-bit target I do not see any changes regarding the testsuite.  For the
31-bit target I see the following failures:

FAIL: gcc.dg/vect/no-scevccp-outer-14.c (internal compiler error: in require, 
at machmode.h:313)
FAIL: gcc.dg/vect/no-scevccp-outer-14.c (test for excess errors)
FAIL: gcc.dg/vect/pr50451.c (internal compiler error: in require, at 
machmode.h:313)
FAIL: gcc.dg/vect/pr50451.c (test for excess errors)
FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (internal compiler error: 
in require, at machmode.h:313)
FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (test for excess errors)
FAIL: gcc.dg/vect/pr53773.c (internal compiler error: in require, at 
machmode.h:313)
FAIL: gcc.dg/vect/pr53773.c (test for excess errors)
FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (internal compiler error: 
in require, at machmode.h:313)
FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (test for excess errors)
FAIL: gcc.dg/vect/pr71407.c (internal compiler error: in require, at 
machmode.h:313)
FAIL: gcc.dg/vect/pr71407.c (test for excess errors)
FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (internal compiler error: 
in require, at machmode.h:313)
FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (test for excess errors)
FAIL: gcc.dg/vect/pr71416-1.c (internal compiler error: in require, at 
machmode.h:313)
FAIL: gcc.dg/vect/pr71416-1.c (test for excess errors)
FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (internal compiler error: 
in require, at machmode.h:313)
FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (test for excess errors)
FAIL: gcc.dg/vect/pr94443.c (internal compiler error: in require, at 
machmode.h:313)
FAIL: gcc.dg/vect/pr94443.c (test for excess errors)
FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (internal compiler error: 
in require, at machmode.h:313)
FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (test for excess errors)
FAIL: gcc.dg/vect/pr97558.c (internal compiler error: in require, at 
machmode.h:313)
FAIL: gcc.dg/vect/pr97558.c (test for excess errors)
FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (internal compiler error: 
in require, at machmode.h:313)
FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (test for excess errors)
FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (internal 
compiler error: in require, at machmode.h:313)
FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (test for 
excess errors)
UNRESOLVED: gcc.dg/vect/no-scevccp-outer-14.c compilation failed to produce 
executable
UNRESOLVED: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects  scan-tree-dump-times 
optimized "\\* 10" 2
UNRESOLVED: gcc.dg/vect/pr53773.c scan-tree-dump-times optimized "\\* 10" 2
UNRESOLVED: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects compilation failed 
to produce executable
UNRESOLVED: gcc.dg/vect/pr71416-1.c compilation failed to produce executable
UNRESOLVED: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects 
compilation failed to produce executable

I've randomely picked pr50451.c and ran gcc against it which results in:

during GIMPLE pass: vect
dump file: pr50451.c.174t.vect
/gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c: In 
function ‘foo’:
/gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c:5:1: 
internal compiler error: in require, at machmode.h:313
0x1265d21 opt_mode::require() const
/gcc-verify-workdir/patched/src/gcc/machmode.h:313
0x1d7e4e9 opt_mode::require() const
/gcc-verify-workdir/patched/src/gcc/vec.h:955
0x1d7e4e9 vect_verify_loop_lens
/gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:1471
0x1da29ab vect_analyze_loop_2
/gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:2929
0x1da40c7 vect_analyze_loop_1
/gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3330
0x1da499d vect_analyze_loop(loop*, vec_info_shared*)
/gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3484
0x1deed27 try_vectorize_loop_1
/gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1064
0x1deed27 try_vectorize_loop
/gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1180
0x1def5c1 execute
/gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1296
Please submit a full bug report, with preprocessed source (by using 
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

I will come back to this tomorrow.

Cheers,
Stefan

On Mon, Aug 14, 2023 at 08:45:21PM +0800, Kewen.Lin via Gcc-patches wrote:
> Hi Juzhe,
> 
> on 2023/8/14 20:08, juzhe.zh...@rivai.ai wrote:
> > Hi, Kewin.
> > 
> > Could you test 'can_vec_extract_var_idx_p' and send V5 patch when you pass 
> > the testing?
> 
> The below diff was bootstrapped and regress-tested on Power10 LE.  Comparing 
> to the
> previous v4, the only changes should be the proposed 
>

Re: [PATCH] rtl-optimization/110939 Really fix narrow comparison of memory and constant

2023-08-14 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Sat, Aug 12, 2023 at 09:04:19AM +0800, Xi Ruoyao wrote:
> On Thu, 2023-08-10 at 15:04 +0200, Stefan Schulze Frielinghaus via Gcc-
> patches wrote:
> > In the former fix in commit 41ef5a34161356817807be3a2e51fbdbe575ae85 I
> > completely missed the fact that the normal form of a generated constant for 
> > a
> > mode with fewer bits than in HOST_WIDE_INT is a sign extended version of the
> > actual constant.  This even holds true for unsigned constants.
> > 
> > Fixed by masking out the upper bits for the incoming constant and sign
> > extending the resulting unsigned constant.
> > 
> > Bootstrapped and regtested on x64 and s390x.  Ok for mainline?
> 
> The patch fails to apply:
> 
> patching file gcc/combine.cc
> Hunk #1 FAILED at 11923.
> Hunk #2 FAILED at 11962.
> 
> It looks like some indents are tabs in the source file, but white spaces
> in the patch.

The patch itself applies cleanly.  This is due to my inlined diff in
order to raise some discussion, i.e., just remove the following from
the email and the patch applies:

> > diff --git a/gcc/combine.cc b/gcc/combine.cc
> > index 468b7fde911..80c4ff0fbaf 100644
> > --- a/gcc/combine.cc
> > +++ b/gcc/combine.cc
> > @@ -11923,7 +11923,7 @@ simplify_compare_const (enum rtx_code code, 
> > machine_mode mode,
> >    /* (unsigned) < 0x8000 is equivalent to >= 0.  */
> >    else if (is_a  (mode, _mode)
> >    && GET_MODE_PRECISION (int_mode) - 1 < HOST_BITS_PER_WIDE_INT
> > -  && ((unsigned HOST_WIDE_INT) const_op
> > +  && (((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK 
> > (int_mode))
> >    == HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 
> > 1)))
> >     {
> >   const_op = 0;
> > @@ -11962,7 +11962,7 @@ simplify_compare_const (enum rtx_code code, 
> > machine_mode mode,
> >    /* (unsigned) >= 0x8000 is equivalent to < 0.  */
> >    else if (is_a  (mode, _mode)
> >    && GET_MODE_PRECISION (int_mode) - 1 < HOST_BITS_PER_WIDE_INT
> > -  && ((unsigned HOST_WIDE_INT) const_op
> > +  && (((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK 
> > (int_mode))
> >    == HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 
> > 1)))
> >     {
> >   const_op = 0;

Looks like git am/apply is confused by that.

Cheers,
Stefan

> > 
> > For example, while bootstrapping on x64 the optimization is missed since
> > a LTU comparison in QImode is done and the constant equals
> > 0xff80.
> > 
> > Sorry for inlining another patch, but I would really like to make sure
> > that my understanding is correct, now, before I come up with another
> > patch.  Thus it would be great if someone could shed some light on this.
> > 
> > gcc/ChangeLog:
> > 
> > * combine.cc (simplify_compare_const): Properly handle unsigned
> > constants while narrowing comparison of memory and constants.
> > ---
> >  gcc/combine.cc | 19 ++-
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/gcc/combine.cc b/gcc/combine.cc
> > index e46d202d0a7..468b7fde911 100644
> > --- a/gcc/combine.cc
> > +++ b/gcc/combine.cc
> > @@ -12003,14 +12003,15 @@ simplify_compare_const (enum rtx_code code, 
> > machine_mode mode,
> >    && !MEM_VOLATILE_P (op0)
> >    /* The optimization makes only sense for constants which are big 
> > enough
> >  so that we have a chance to chop off something at all.  */
> > -  && (unsigned HOST_WIDE_INT) const_op > 0xff
> > -  /* Bail out, if the constant does not fit into INT_MODE.  */
> > -  && (unsigned HOST_WIDE_INT) const_op
> > -    < ((HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 1) << 1) 
> > - 1)
> > +  && ((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode)) > 
> > 0xff
> >    /* Ensure that we do not overflow during normalization.  */
> > -  && (code != GTU || (unsigned HOST_WIDE_INT) const_op < 
> > HOST_WIDE_INT_M1U))
> > +  && (code != GTU
> > + || ((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode))
> > +    < HOST_WIDE_INT_M1U)
> > +  && trunc_int_for_mode (const_op, int_mode) == const_op)
> >  {
> > -  unsigned HOST_WIDE_INT n = (unsigned HOST_WIDE_INT) cons

[PATCH] rtl-optimization/110939 Really fix narrow comparison of memory and constant

2023-08-10 Thread Stefan Schulze Frielinghaus via Gcc-patches

In the former fix in commit 41ef5a34161356817807be3a2e51fbdbe575ae85 I
completely missed the fact that the normal form of a generated constant for a
mode with fewer bits than in HOST_WIDE_INT is a sign extended version of the
actual constant.  This even holds true for unsigned constants.

Fixed by masking out the upper bits for the incoming constant and sign
extending the resulting unsigned constant.

Bootstrapped and regtested on x64 and s390x.  Ok for mainline?

While reading existing optimizations in combine I stumbled across two
optimizations where either my intuition about the representation of
unsigned integers via a const_int rtx is wrong, which then in turn would
probably also mean that this patch is wrong, or that the optimizations
are missed sometimes.  In other words in the following I would assume
that the upper bits are masked out:

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 468b7fde911..80c4ff0fbaf 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -11923,7 +11923,7 @@ simplify_compare_const (enum rtx_code code, 
machine_mode mode,
   /* (unsigned) < 0x8000 is equivalent to >= 0.  */
   else if (is_a  (mode, _mode)
   && GET_MODE_PRECISION (int_mode) - 1 < HOST_BITS_PER_WIDE_INT
-  && ((unsigned HOST_WIDE_INT) const_op
+  && (((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK 
(int_mode))
   == HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 1)))
{
  const_op = 0;
@@ -11962,7 +11962,7 @@ simplify_compare_const (enum rtx_code code, 
machine_mode mode,
   /* (unsigned) >= 0x8000 is equivalent to < 0.  */
   else if (is_a  (mode, _mode)
   && GET_MODE_PRECISION (int_mode) - 1 < HOST_BITS_PER_WIDE_INT
-  && ((unsigned HOST_WIDE_INT) const_op
+  && (((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK 
(int_mode))
   == HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 1)))
{
  const_op = 0;

For example, while bootstrapping on x64 the optimization is missed since
a LTU comparison in QImode is done and the constant equals
0xff80.

Sorry for inlining another patch, but I would really like to make sure
that my understanding is correct, now, before I come up with another
patch.  Thus it would be great if someone could shed some light on this.

gcc/ChangeLog:

* combine.cc (simplify_compare_const): Properly handle unsigned
constants while narrowing comparison of memory and constants.
---
 gcc/combine.cc | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/gcc/combine.cc b/gcc/combine.cc
index e46d202d0a7..468b7fde911 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -12003,14 +12003,15 @@ simplify_compare_const (enum rtx_code code, 
machine_mode mode,
   && !MEM_VOLATILE_P (op0)
   /* The optimization makes only sense for constants which are big enough
 so that we have a chance to chop off something at all.  */
-  && (unsigned HOST_WIDE_INT) const_op > 0xff
-  /* Bail out, if the constant does not fit into INT_MODE.  */
-  && (unsigned HOST_WIDE_INT) const_op
-< ((HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 1) << 1) - 1)
+  && ((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode)) > 0xff
   /* Ensure that we do not overflow during normalization.  */
-  && (code != GTU || (unsigned HOST_WIDE_INT) const_op < 
HOST_WIDE_INT_M1U))
+  && (code != GTU
+ || ((unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode))
+< HOST_WIDE_INT_M1U)
+  && trunc_int_for_mode (const_op, int_mode) == const_op)
 {
-  unsigned HOST_WIDE_INT n = (unsigned HOST_WIDE_INT) const_op;
+  unsigned HOST_WIDE_INT n
+   = (unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode);
   enum rtx_code adjusted_code;
 
   /* Normalize code to either LEU or GEU.  */
@@ -12051,15 +12052,15 @@ simplify_compare_const (enum rtx_code code, 
machine_mode mode,
HOST_WIDE_INT_PRINT_HEX ") to (MEM %s "
HOST_WIDE_INT_PRINT_HEX ").\n", GET_MODE_NAME (int_mode),
GET_MODE_NAME (narrow_mode_iter), GET_RTX_NAME (code),
-   (unsigned HOST_WIDE_INT)const_op, GET_RTX_NAME (adjusted_code),
-   n);
+   (unsigned HOST_WIDE_INT) const_op & GET_MODE_MASK (int_mode),
+   GET_RTX_NAME (adjusted_code), n);
}
  poly_int64 offset = (BYTES_BIG_ENDIAN
   ? 0
   : (GET_MODE_SIZE (int_mode)
  - GET_MODE_SIZE (narrow_mode_iter)));
  *pop0 = adjust_address_nv (op0, narrow_mode_iter, offset);
- *pop1 = GEN_INT (n);
+ *pop1 = gen_int_mode (n, narrow_mode_iter);
  return adjusted_code;
}
 }
-- 
2.41.0

[PATCH] rtl-optimization/110869 Fix tests cmp-mem-const-*.c for sparc

2023-08-07 Thread Stefan Schulze Frielinghaus via Gcc-patches

This fixes the rather new tests cmp-mem-const-{1,2,3,4,5,6}.c for sparc.
For -1 and -2 we need at least optimization level 2 on sparc.  For the
sake of homogeneity, change all test cases to -O2.  For -3 and -4 we do
not end up with a comparison of memory and a constant, and finally for
-5 and -6 the constants are reduced by a prior optimization which means
there is nothing left to do.  Thus excluding sparc from those tests.

Ok for mainline?

gcc/testsuite/ChangeLog:

PR rtl-optimization/110869
* gcc.dg/cmp-mem-const-1.c: Use optimization level 2.
* gcc.dg/cmp-mem-const-2.c: Dito.
* gcc.dg/cmp-mem-const-3.c: Exclude sparc from this test.
* gcc.dg/cmp-mem-const-4.c: Dito.
* gcc.dg/cmp-mem-const-5.c: Dito.
* gcc.dg/cmp-mem-const-6.c: Dito.
---
 gcc/testsuite/gcc.dg/cmp-mem-const-1.c | 2 +-
 gcc/testsuite/gcc.dg/cmp-mem-const-2.c | 2 +-
 gcc/testsuite/gcc.dg/cmp-mem-const-3.c | 6 --
 gcc/testsuite/gcc.dg/cmp-mem-const-4.c | 6 --
 gcc/testsuite/gcc.dg/cmp-mem-const-5.c | 6 --
 gcc/testsuite/gcc.dg/cmp-mem-const-6.c | 6 --
 6 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-1.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-1.c
index 4f21a1ade4a..0b0e7331354 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-1.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { lp64 } } } */
-/* { dg-options "-O1 -fdump-rtl-combine-details" } */
+/* { dg-options "-O2 -fdump-rtl-combine-details" } */
 /* { dg-final { scan-rtl-dump "narrow comparison from mode .I to QI" "combine" 
} } */
 
 typedef __UINT64_TYPE__ uint64_t;
diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-2.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-2.c
index 7b722951594..8022137a8ec 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-2.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { lp64 } } } */
-/* { dg-options "-O1 -fdump-rtl-combine-details" } */
+/* { dg-options "-O2 -fdump-rtl-combine-details" } */
 /* { dg-final { scan-rtl-dump "narrow comparison from mode .I to QI" "combine" 
} } */
 
 typedef __UINT64_TYPE__ uint64_t;
diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-3.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-3.c
index ed5059d3807..c60ecdb4026 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-3.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-3.c
@@ -1,5 +1,7 @@
-/* { dg-do compile { target { lp64 } } } */
-/* { dg-options "-O1 -fdump-rtl-combine-details" } */
+/* { dg-do compile { target { lp64 && { ! sparc*-*-* } } } } */
+/* Excluding sparc since there we do not end up with a comparison of memory and
+   a constant which means that the optimization is not applicable.  */
+/* { dg-options "-O2 -fdump-rtl-combine-details" } */
 /* { dg-final { scan-rtl-dump "narrow comparison from mode .I to HI" "combine" 
} } */
 
 typedef __UINT64_TYPE__ uint64_t;
diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-4.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-4.c
index 23e83372bee..7aa403d76d9 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-4.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-4.c
@@ -1,5 +1,7 @@
-/* { dg-do compile { target { lp64 } } } */
-/* { dg-options "-O1 -fdump-rtl-combine-details" } */
+/* { dg-do compile { target { lp64 && { ! sparc*-*-* } } } } */
+/* Excluding sparc since there we do not end up with a comparison of memory and
+   a constant which means that the optimization is not applicable.  */
+/* { dg-options "-O2 -fdump-rtl-combine-details" } */
 /* { dg-final { scan-rtl-dump "narrow comparison from mode .I to HI" "combine" 
} } */
 
 typedef __UINT64_TYPE__ uint64_t;
diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-5.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-5.c
index d266896a25e..4316dcb5605 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-5.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-5.c
@@ -1,5 +1,7 @@
-/* { dg-do compile { target { lp64 } && ! target { sparc*-*-* } } } */
-/* { dg-options "-O1 -fdump-rtl-combine-details" } */
+/* { dg-do compile { target { lp64 && { ! sparc*-*-* } } } } */
+/* Excluding sparc since there a prior optimization already reduced the
+   constant, i.e., nothing left for us.  */
+/* { dg-options "-O2 -fdump-rtl-combine-details" } */
 /* { dg-final { scan-rtl-dump "narrow comparison from mode .I to SI" "combine" 
} } */
 
 typedef __UINT64_TYPE__ uint64_t;
diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-6.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-6.c
index 68d7a9d0265..d9046af79eb 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-6.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-6.c
@@ -1,5 +1,7 @@
-/* { dg-do compile { target { lp64 } && ! target { sparc*-*-* } } } */
-/* { dg-options "-O1 -fdump-rtl-combine-details" } */
+/* { dg-do compile { target { lp64 && { ! sparc*-*-* } } } } */
+/* Excluding sparc since there a prior optimization already reduced the
+   constant, i.e., nothing left for us.  */
+/* { dg-options "-O2 -fdump-rtl-combine-details" } */

[PATCH] s390: Try to emit vlbr/vstbr instead of vperm et al.

2023-08-03 Thread Stefan Schulze Frielinghaus via Gcc-patches

Bootstrapped and regtested on s390x.  Ok for mainline?

gcc/ChangeLog:

* config/s390/s390.cc (expand_perm_as_a_vlbr_vstbr_candidate):
New function which handles bswap patterns for vec_perm_const.
(vectorize_vec_perm_const_1): Call new function.
* config/s390/vector.md (*bswap): Fix operands in output
template.
(*vstbr): New insn.

gcc/testsuite/ChangeLog:

* gcc.target/s390/s390.exp: Add subdirectory vxe2.
* gcc.target/s390/vxe2/vlbr-1.c: New test.
* gcc.target/s390/vxe2/vstbr-1.c: New test.
* gcc.target/s390/vxe2/vstbr-2.c: New test.
---
 gcc/config/s390/s390.cc  | 55 
 gcc/config/s390/vector.md| 16 --
 gcc/testsuite/gcc.target/s390/s390.exp   |  3 ++
 gcc/testsuite/gcc.target/s390/vxe2/vlbr-1.c  | 29 +++
 gcc/testsuite/gcc.target/s390/vxe2/vstbr-1.c | 29 +++
 gcc/testsuite/gcc.target/s390/vxe2/vstbr-2.c | 42 +++
 6 files changed, 170 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/s390/vxe2/vlbr-1.c
 create mode 100644 gcc/testsuite/gcc.target/s390/vxe2/vstbr-1.c
 create mode 100644 gcc/testsuite/gcc.target/s390/vxe2/vstbr-2.c

diff --git a/gcc/config/s390/s390.cc b/gcc/config/s390/s390.cc
index d9f10542473..91eb9232b10 100644
--- a/gcc/config/s390/s390.cc
+++ b/gcc/config/s390/s390.cc
@@ -17698,6 +17698,58 @@ expand_perm_with_vstbrq (const struct 
expand_vec_perm_d )
   return false;
 }
 
+/* Try to emit vlbr/vstbr.  Note, this is only a candidate insn since
+   TARGET_VECTORIZE_VEC_PERM_CONST operates on vector registers only.  Thus,
+   either fwprop, combine et al. "fixes" one of the input/output operands into
+   a memory operand or a splitter has to reverse this into a general vperm
+   operation.  */
+
+static bool
+expand_perm_as_a_vlbr_vstbr_candidate (const struct expand_vec_perm_d )
+{
+  static const char perm[4][MAX_VECT_LEN]
+= { { 1,  0,  3,  2,  5,  4,  7, 6, 9,  8,  11, 10, 13, 12, 15, 14 },
+   { 3,  2,  1,  0,  7,  6,  5, 4, 11, 10, 9,  8,  15, 14, 13, 12 },
+   { 7,  6,  5,  4,  3,  2,  1, 0, 15, 14, 13, 12, 11, 10, 9,  8  },
+   { 15, 14, 13, 12, 11, 10, 9, 8, 7,  6,  5,  4,  3,  2,  1,  0  } };
+
+  if (!TARGET_VXE2 || d.vmode != V16QImode || d.op0 != d.op1)
+return false;
+
+  if (memcmp (d.perm, perm[0], MAX_VECT_LEN) == 0)
+{
+  rtx target = gen_rtx_SUBREG (V8HImode, d.target, 0);
+  rtx op0 = gen_rtx_SUBREG (V8HImode, d.op0, 0);
+  emit_insn (gen_bswapv8hi (target, op0));
+  return true;
+}
+
+  if (memcmp (d.perm, perm[1], MAX_VECT_LEN) == 0)
+{
+  rtx target = gen_rtx_SUBREG (V4SImode, d.target, 0);
+  rtx op0 = gen_rtx_SUBREG (V4SImode, d.op0, 0);
+  emit_insn (gen_bswapv4si (target, op0));
+  return true;
+}
+
+  if (memcmp (d.perm, perm[2], MAX_VECT_LEN) == 0)
+{
+  rtx target = gen_rtx_SUBREG (V2DImode, d.target, 0);
+  rtx op0 = gen_rtx_SUBREG (V2DImode, d.op0, 0);
+  emit_insn (gen_bswapv2di (target, op0));
+  return true;
+}
+
+  if (memcmp (d.perm, perm[3], MAX_VECT_LEN) == 0)
+{
+  rtx target = gen_rtx_SUBREG (V1TImode, d.target, 0);
+  rtx op0 = gen_rtx_SUBREG (V1TImode, d.op0, 0);
+  emit_insn (gen_bswapv1ti (target, op0));
+  return true;
+}
+
+  return false;
+}
 
 /* Try to find the best sequence for the vector permute operation
described by D.  Return true if the operation could be
@@ -17720,6 +17772,9 @@ vectorize_vec_perm_const_1 (const struct 
expand_vec_perm_d )
   if (expand_perm_with_rot (d))
 return true;
 
+  if (expand_perm_as_a_vlbr_vstbr_candidate (d))
+return true;
+
   return false;
 }
 
diff --git a/gcc/config/s390/vector.md b/gcc/config/s390/vector.md
index 21bec729efa..f0e9ed3d263 100644
--- a/gcc/config/s390/vector.md
+++ b/gcc/config/s390/vector.md
@@ -47,6 +47,7 @@
 (define_mode_iterator VI_HW [V16QI V8HI V4SI V2DI])
 (define_mode_iterator VI_HW_QHS [V16QI V8HI V4SI])
 (define_mode_iterator VI_HW_HSD [V8HI  V4SI V2DI])
+(define_mode_iterator VI_HW_HSDT [V8HI V4SI V2DI V1TI TI])
 (define_mode_iterator VI_HW_HS  [V8HI  V4SI])
 (define_mode_iterator VI_HW_QH  [V16QI V8HI])
 
@@ -2876,12 +2877,12 @@
  (use (match_dup 2))])]
   "TARGET_VX"
 {
-  static char p[4][16] =
+  static const char p[4][16] =
 { { 1,  0,  3,  2,  5,  4,  7, 6, 9,  8,  11, 10, 13, 12, 15, 14 },   /* H 
*/
   { 3,  2,  1,  0,  7,  6,  5, 4, 11, 10, 9,  8,  15, 14, 13, 12 },   /* S 
*/
   { 7,  6,  5,  4,  3,  2,  1, 0, 15, 14, 13, 12, 11, 10, 9,  8  },   /* D 
*/
   { 15, 14, 13, 12, 11, 10, 9, 8, 7,  6,  5,  4,  3,  2,  1,  0  } }; /* T 
*/
-  char *perm;
+  const char *perm;
   rtx perm_rtx[16];
 
   switch (GET_MODE_SIZE (GET_MODE_INNER (mode)))
@@ -2933,8 +2934,8 @@
   "TARGET_VXE2"
   "@
#
-   vlbr\t%v0,%v1
-   vstbr\t%v1,%v0"
+   vlbr\t%v0,%1
+   vstbr\t%v1,%0"
   "&& reload_completed
&& !memory_operand

[PATCH] s390: Enable vect_bswap test cases

2023-08-03 Thread Stefan Schulze Frielinghaus via Gcc-patches

This enables the following tests which rely on instruction vperm which
is available since z13 with the initial vector support.

testsuite/gcc.dg/vect/vect-bswap16.c
42:/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
{ vect_bswap || sse4_runtime } } } } */

testsuite/gcc.dg/vect/vect-bswap32.c
42:/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
{ vect_bswap || sse4_runtime } } } } */

testsuite/gcc.dg/vect/vect-bswap64.c
42:/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
{ vect_bswap || sse4_runtime } } } } */

Ok for mainline?

gcc/testsuite/ChangeLog:

* lib/target-supports.exp (check_effective_target_vect_bswap):
Add s390.
---
 gcc/testsuite/lib/target-supports.exp | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 4d04df2a709..2ccc0291442 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -7087,9 +7087,11 @@ proc check_effective_target_whole_vector_shift { } {
 
 proc check_effective_target_vect_bswap { } {
 return [check_cached_effective_target_indexed vect_bswap {
-  expr { [istarget aarch64*-*-*]
-|| [is-effective-target arm_neon]
-|| [istarget amdgcn-*-*] }}]
+  expr { ([istarget aarch64*-*-*]
+ || [is-effective-target arm_neon]
+ || [istarget amdgcn-*-*])
+|| ([istarget s390*-*-*]
+&& [check_effective_target_s390_vx]) }}]
 }
 
 # Return 1 if the target supports comparison of bool vectors for at
-- 
2.41.0

[PATCH] PR combine/110867 Fix narrow comparison of memory and constant

2023-08-02 Thread Stefan Schulze Frielinghaus via Gcc-patches

In certain cases a constant may not fit into the mode used to perform a
comparison.  This may be the case for sign-extended constants which are
used during an unsigned comparison as e.g. in

(set (reg:CC 100 cc)
(compare:CC (mem:SI (reg/v/f:SI 115 [ a ]) [1 *a_4(D)+0 S4 A64])
(const_int -2147483648 [0x8000])))

Fixed by ensuring that the constant fits into comparison mode.

Furthermore, on some targets as e.g. sparc the constant used in a
comparison is chopped off before combine which leads to failing test
cases (see PR 110869).  Fixed by not requiring that the source mode has
to be DImode, and excluding sparc from the last two test cases entirely
since there the constant cannot be further reduced.

According to PR 110867 and 110869 this patch resolves bootstrap problems
on armv8l and sparc.  While writing this, bootstrap+regtest are still
running on x64 and s390x.  Assuming they pass, ok for mainline?

gcc/ChangeLog:

PR combine/110867
* combine.cc (simplify_compare_const): Try the optimization only
in case the constant fits into the comparison mode.

gcc/testsuite/ChangeLog:

PR combine/110869
* gcc.dg/cmp-mem-const-1.c: Relax mode for constant.
* gcc.dg/cmp-mem-const-2.c: Relax mode for constant.
* gcc.dg/cmp-mem-const-3.c: Relax mode for constant.
* gcc.dg/cmp-mem-const-4.c: Relax mode for constant.
* gcc.dg/cmp-mem-const-5.c: Exclude sparc since here the
constant is already reduced.
* gcc.dg/cmp-mem-const-6.c: Exclude sparc since here the
constant is already reduced.
---
 gcc/combine.cc | 4 
 gcc/testsuite/gcc.dg/cmp-mem-const-1.c | 2 +-
 gcc/testsuite/gcc.dg/cmp-mem-const-2.c | 2 +-
 gcc/testsuite/gcc.dg/cmp-mem-const-3.c | 2 +-
 gcc/testsuite/gcc.dg/cmp-mem-const-4.c | 2 +-
 gcc/testsuite/gcc.dg/cmp-mem-const-5.c | 4 ++--
 gcc/testsuite/gcc.dg/cmp-mem-const-6.c | 4 ++--
 7 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 0d99fa541c5..e46d202d0a7 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -11998,11 +11998,15 @@ simplify_compare_const (enum rtx_code code, 
machine_mode mode,
  x0 >= 0x40.  */
   if ((code == LEU || code == LTU || code == GEU || code == GTU)
   && is_a  (GET_MODE (op0), _mode)
+  && HWI_COMPUTABLE_MODE_P (int_mode)
   && MEM_P (op0)
   && !MEM_VOLATILE_P (op0)
   /* The optimization makes only sense for constants which are big enough
 so that we have a chance to chop off something at all.  */
   && (unsigned HOST_WIDE_INT) const_op > 0xff
+  /* Bail out, if the constant does not fit into INT_MODE.  */
+  && (unsigned HOST_WIDE_INT) const_op
+< ((HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 1) << 1) - 1)
   /* Ensure that we do not overflow during normalization.  */
   && (code != GTU || (unsigned HOST_WIDE_INT) const_op < 
HOST_WIDE_INT_M1U))
 {
diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-1.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-1.c
index 263ad98af79..4f21a1ade4a 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-1.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-1.c
@@ -1,6 +1,6 @@
 /* { dg-do compile { target { lp64 } } } */
 /* { dg-options "-O1 -fdump-rtl-combine-details" } */
-/* { dg-final { scan-rtl-dump "narrow comparison from mode DI to QI" "combine" 
} } */
+/* { dg-final { scan-rtl-dump "narrow comparison from mode .I to QI" "combine" 
} } */
 
 typedef __UINT64_TYPE__ uint64_t;
 
diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-2.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-2.c
index a7cc5348295..7b722951594 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-2.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-2.c
@@ -1,6 +1,6 @@
 /* { dg-do compile { target { lp64 } } } */
 /* { dg-options "-O1 -fdump-rtl-combine-details" } */
-/* { dg-final { scan-rtl-dump "narrow comparison from mode DI to QI" "combine" 
} } */
+/* { dg-final { scan-rtl-dump "narrow comparison from mode .I to QI" "combine" 
} } */
 
 typedef __UINT64_TYPE__ uint64_t;
 
diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-3.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-3.c
index 06f80bf72d8..ed5059d3807 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-3.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-3.c
@@ -1,6 +1,6 @@
 /* { dg-do compile { target { lp64 } } } */
 /* { dg-options "-O1 -fdump-rtl-combine-details" } */
-/* { dg-final { scan-rtl-dump "narrow comparison from mode DI to HI" "combine" 
} } */
+/* { dg-final { scan-rtl-dump "narrow comparison from mode .I to HI" "combine" 
} } */
 
 typedef __UINT64_TYPE__ uint64_t;
 
diff --git a/gcc/testsuite/gcc.dg/cmp-mem-const-4.c 
b/gcc/testsuite/gcc.dg/cmp-mem-const-4.c
index 407999abf7e..23e83372bee 100644
--- a/gcc/testsuite/gcc.dg/cmp-mem-const-4.c
+++ b/gcc/testsuite/gcc.dg/cmp-mem-const-4.c
@@ -1,6 +1,6 @@
 /* { dg-do compile { target { lp64 } } } */
 /* { dg-options "-O1 -fdump-rtl-combine-details" } */
-/* {

Re: [PATCH v2] combine: Narrow comparison of memory and constant

2023-08-01 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Tue, Aug 01, 2023 at 01:52:16PM +0530, Prathamesh Kulkarni wrote:
> On Tue, 1 Aug 2023 at 05:20, Jeff Law  wrote:
> >
> >
> >
> > On 7/31/23 15:43, Prathamesh Kulkarni via Gcc-patches wrote:
> > > On Mon, 19 Jun 2023 at 19:59, Stefan Schulze Frielinghaus via
> > > Gcc-patches  wrote:
> > >>
> > >> Comparisons between memory and constants might be done in a smaller mode
> > >> resulting in smaller constants which might finally end up as immediates
> > >> instead of in the literal pool.
> > >>
> > >> For example, on s390x a non-symmetric comparison like
> > >>x <= 0x3fff
> > >> results in the constant being spilled to the literal pool and an 8 byte
> > >> memory comparison is emitted.  Ideally, an equivalent comparison
> > >>x0 <= 0x3f
> > >> where x0 is the most significant byte of x, is emitted where the
> > >> constant is smaller and more likely to materialize as an immediate.
> > >>
> > >> Similarly, comparisons of the form
> > >>x >= 0x4000
> > >> can be shortened into x0 >= 0x40.
> > >>
> > >> Bootstrapped and regtested on s390x, x64, aarch64, and powerpc64le.
> > >> Note, the new tests show that for the mentioned little-endian targets
> > >> the optimization does not materialize since either the costs of the new
> > >> instructions are higher or they do not match.  Still ok for mainline?
> > > Hi Stefan,
> > > Unfortunately this patch (committed in 
> > > 7cdd0860949c6c3232e6cff1d7ca37bb5234074c)
> > > caused the following ICE on armv8l-unknown-linux-gnu:
> > > during RTL pass: combine
> > > ../../../gcc/libgcc/fixed-bit.c: In function ‘__gnu_saturate1sq’:
> > > ../../../gcc/libgcc/fixed-bit.c:210:1: internal compiler error: in
> > > decompose, at rtl.h:2297
> > >210 | }
> > >| ^
> > > 0xaa23e3 wi::int_traits
> > >> ::decompose(long long*, unsigned int, std::pair > > machine_mode> const&)
> > >  ../../gcc/gcc/rtl.h:2297
> > [ ... ]
> > Yea, we're seeing something very similar on nios2-linux-gnu building the
> > kernel.
> >
> > Prathamesh, can you extract the .i file for fixed-bit on armv8 and open
> > a bug for this issue, attaching the .i file as well as the right command
> > line options necessary to reproduce the failure.  THat way Stefan can
> > tackle it with a cross compiler.
> Hi Jeff,
> Filed the issue in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110867

Hi Prathamesh,

Sorry for the inconvenience.  I will have a look at this and thanks for
the small reproducer.  I already started to come up with a cross
compiler.

Thanks,
Stefan

> 
> Thanks,
> Prathamesh
> >
> > Thanks,
> > jeff

Re: [PATCH v2] combine: Narrow comparison of memory and constant

2023-07-31 Thread Stefan Schulze Frielinghaus via Gcc-patches

ping

On Mon, Jun 19, 2023 at 04:23:57PM +0200, Stefan Schulze Frielinghaus wrote:
> Comparisons between memory and constants might be done in a smaller mode
> resulting in smaller constants which might finally end up as immediates
> instead of in the literal pool.
> 
> For example, on s390x a non-symmetric comparison like
>   x <= 0x3fff
> results in the constant being spilled to the literal pool and an 8 byte
> memory comparison is emitted.  Ideally, an equivalent comparison
>   x0 <= 0x3f
> where x0 is the most significant byte of x, is emitted where the
> constant is smaller and more likely to materialize as an immediate.
> 
> Similarly, comparisons of the form
>   x >= 0x4000
> can be shortened into x0 >= 0x40.
> 
> Bootstrapped and regtested on s390x, x64, aarch64, and powerpc64le.
> Note, the new tests show that for the mentioned little-endian targets
> the optimization does not materialize since either the costs of the new
> instructions are higher or they do not match.  Still ok for mainline?
> 
> gcc/ChangeLog:
> 
>   * combine.cc (simplify_compare_const): Narrow comparison of
>   memory and constant.
>   (try_combine): Adapt new function signature.
>   (simplify_comparison): Adapt new function signature.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/cmp-mem-const-1.c: New test.
>   * gcc.dg/cmp-mem-const-2.c: New test.
>   * gcc.dg/cmp-mem-const-3.c: New test.
>   * gcc.dg/cmp-mem-const-4.c: New test.
>   * gcc.dg/cmp-mem-const-5.c: New test.
>   * gcc.dg/cmp-mem-const-6.c: New test.
>   * gcc.target/s390/cmp-mem-const-1.c: New test.
> ---
>  gcc/combine.cc| 79 +--
>  gcc/testsuite/gcc.dg/cmp-mem-const-1.c| 17 
>  gcc/testsuite/gcc.dg/cmp-mem-const-2.c| 17 
>  gcc/testsuite/gcc.dg/cmp-mem-const-3.c| 17 
>  gcc/testsuite/gcc.dg/cmp-mem-const-4.c| 17 
>  gcc/testsuite/gcc.dg/cmp-mem-const-5.c| 17 
>  gcc/testsuite/gcc.dg/cmp-mem-const-6.c| 17 
>  .../gcc.target/s390/cmp-mem-const-1.c | 24 ++
>  8 files changed, 200 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-4.c
>  create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-5.c
>  create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-6.c
>  create mode 100644 gcc/testsuite/gcc.target/s390/cmp-mem-const-1.c
> 
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index 5aa0ec5c45a..56e15a93409 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -460,7 +460,7 @@ static rtx simplify_shift_const (rtx, enum rtx_code, 
> machine_mode, rtx,
>  static int recog_for_combine (rtx *, rtx_insn *, rtx *);
>  static rtx gen_lowpart_for_combine (machine_mode, rtx);
>  static enum rtx_code simplify_compare_const (enum rtx_code, machine_mode,
> -  rtx, rtx *);
> +  rtx *, rtx *);
>  static enum rtx_code simplify_comparison (enum rtx_code, rtx *, rtx *);
>  static void update_table_tick (rtx);
>  static void record_value_for_reg (rtx, rtx_insn *, rtx);
> @@ -3185,7 +3185,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
> rtx_insn *i0,
> compare_code = orig_compare_code = GET_CODE (*cc_use_loc);
> if (is_a  (GET_MODE (i2dest), ))
>   compare_code = simplify_compare_const (compare_code, mode,
> -op0, );
> +, );
> target_canonicalize_comparison (_code, , , 1);
>   }
>  
> @@ -11796,13 +11796,14 @@ gen_lowpart_for_combine (machine_mode omode, rtx x)
> (CODE OP0 const0_rtx) form.
>  
> The result is a possibly different comparison code to use.
> -   *POP1 may be updated.  */
> +   *POP0 and *POP1 may be updated.  */
>  
>  static enum rtx_code
>  simplify_compare_const (enum rtx_code code, machine_mode mode,
> - rtx op0, rtx *pop1)
> + rtx *pop0, rtx *pop1)
>  {
>scalar_int_mode int_mode;
> +  rtx op0 = *pop0;
>HOST_WIDE_INT const_op = INTVAL (*pop1);
>  
>/* Get the constant we are comparing against and turn off all bits
> @@ -11987,6 +11988,74 @@ simplify_compare_const (enum rtx_code code, 
> machine_mode mode,
>break;
>  }
>  
> +  /* Narrow non-symmetric comparison of memory and constant as e.g.
> + x0...x7 <= 0x3fff into x0 <= 0x3f where x0 is the most
> + significant byte.  Likewise, transform x0...x7 >= 0x4000 
> into
> + x0 >= 0x40.  */
> +  if ((code == LEU || code == LTU || code == GEU || code == GTU)
> +  && is_a  (GET_MODE (op0), _mode)
> +  && MEM_P (op0)
> +  && !MEM_VOLATILE_P (op0)
> +  /* The

[PATCH v2] combine: Narrow comparison of memory and constant

2023-06-19 Thread Stefan Schulze Frielinghaus via Gcc-patches

Comparisons between memory and constants might be done in a smaller mode
resulting in smaller constants which might finally end up as immediates
instead of in the literal pool.

For example, on s390x a non-symmetric comparison like
  x <= 0x3fff
results in the constant being spilled to the literal pool and an 8 byte
memory comparison is emitted.  Ideally, an equivalent comparison
  x0 <= 0x3f
where x0 is the most significant byte of x, is emitted where the
constant is smaller and more likely to materialize as an immediate.

Similarly, comparisons of the form
  x >= 0x4000
can be shortened into x0 >= 0x40.

Bootstrapped and regtested on s390x, x64, aarch64, and powerpc64le.
Note, the new tests show that for the mentioned little-endian targets
the optimization does not materialize since either the costs of the new
instructions are higher or they do not match.  Still ok for mainline?

gcc/ChangeLog:

* combine.cc (simplify_compare_const): Narrow comparison of
memory and constant.
(try_combine): Adapt new function signature.
(simplify_comparison): Adapt new function signature.

gcc/testsuite/ChangeLog:

* gcc.dg/cmp-mem-const-1.c: New test.
* gcc.dg/cmp-mem-const-2.c: New test.
* gcc.dg/cmp-mem-const-3.c: New test.
* gcc.dg/cmp-mem-const-4.c: New test.
* gcc.dg/cmp-mem-const-5.c: New test.
* gcc.dg/cmp-mem-const-6.c: New test.
* gcc.target/s390/cmp-mem-const-1.c: New test.
---
 gcc/combine.cc| 79 +--
 gcc/testsuite/gcc.dg/cmp-mem-const-1.c| 17 
 gcc/testsuite/gcc.dg/cmp-mem-const-2.c| 17 
 gcc/testsuite/gcc.dg/cmp-mem-const-3.c| 17 
 gcc/testsuite/gcc.dg/cmp-mem-const-4.c| 17 
 gcc/testsuite/gcc.dg/cmp-mem-const-5.c| 17 
 gcc/testsuite/gcc.dg/cmp-mem-const-6.c| 17 
 .../gcc.target/s390/cmp-mem-const-1.c | 24 ++
 8 files changed, 200 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-1.c
 create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-2.c
 create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-3.c
 create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-4.c
 create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-5.c
 create mode 100644 gcc/testsuite/gcc.dg/cmp-mem-const-6.c
 create mode 100644 gcc/testsuite/gcc.target/s390/cmp-mem-const-1.c

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 5aa0ec5c45a..56e15a93409 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -460,7 +460,7 @@ static rtx simplify_shift_const (rtx, enum rtx_code, 
machine_mode, rtx,
 static int recog_for_combine (rtx *, rtx_insn *, rtx *);
 static rtx gen_lowpart_for_combine (machine_mode, rtx);
 static enum rtx_code simplify_compare_const (enum rtx_code, machine_mode,
-rtx, rtx *);
+rtx *, rtx *);
 static enum rtx_code simplify_comparison (enum rtx_code, rtx *, rtx *);
 static void update_table_tick (rtx);
 static void record_value_for_reg (rtx, rtx_insn *, rtx);
@@ -3185,7 +3185,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
  compare_code = orig_compare_code = GET_CODE (*cc_use_loc);
  if (is_a  (GET_MODE (i2dest), ))
compare_code = simplify_compare_const (compare_code, mode,
-  op0, );
+  , );
  target_canonicalize_comparison (_code, , , 1);
}
 
@@ -11796,13 +11796,14 @@ gen_lowpart_for_combine (machine_mode omode, rtx x)
(CODE OP0 const0_rtx) form.
 
The result is a possibly different comparison code to use.
-   *POP1 may be updated.  */
+   *POP0 and *POP1 may be updated.  */
 
 static enum rtx_code
 simplify_compare_const (enum rtx_code code, machine_mode mode,
-   rtx op0, rtx *pop1)
+   rtx *pop0, rtx *pop1)
 {
   scalar_int_mode int_mode;
+  rtx op0 = *pop0;
   HOST_WIDE_INT const_op = INTVAL (*pop1);
 
   /* Get the constant we are comparing against and turn off all bits
@@ -11987,6 +11988,74 @@ simplify_compare_const (enum rtx_code code, 
machine_mode mode,
   break;
 }
 
+  /* Narrow non-symmetric comparison of memory and constant as e.g.
+ x0...x7 <= 0x3fff into x0 <= 0x3f where x0 is the most
+ significant byte.  Likewise, transform x0...x7 >= 0x4000 into
+ x0 >= 0x40.  */
+  if ((code == LEU || code == LTU || code == GEU || code == GTU)
+  && is_a  (GET_MODE (op0), _mode)
+  && MEM_P (op0)
+  && !MEM_VOLATILE_P (op0)
+  /* The optimization makes only sense for constants which are big enough
+so that we have a chance to chop off something at all.  */
+  && (unsigned HOST_WIDE_INT) const_op > 0xff
+  /* Ensure that we do not overflow during normalization.  */
+  && (code

Re: [PATCH] combine: Narrow comparison of memory and constant

2023-06-19 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Mon, Jun 12, 2023 at 03:29:00PM -0600, Jeff Law wrote:
> 
> 
> On 6/12/23 01:57, Stefan Schulze Frielinghaus via Gcc-patches wrote:
> > Comparisons between memory and constants might be done in a smaller mode
> > resulting in smaller constants which might finally end up as immediates
> > instead of in the literal pool.
> > 
> > For example, on s390x a non-symmetric comparison like
> >x <= 0x3fff
> > results in the constant being spilled to the literal pool and an 8 byte
> > memory comparison is emitted.  Ideally, an equivalent comparison
> >x0 <= 0x3f
> > where x0 is the most significant byte of x, is emitted where the
> > constant is smaller and more likely to materialize as an immediate.
> > 
> > Similarly, comparisons of the form
> >x >= 0x4000
> > can be shortened into x0 >= 0x40.
> > 
> > I'm not entirely sure whether combine is the right place to implement
> > something like this.  In my first try I implemented it in
> > TARGET_CANONICALIZE_COMPARISON but then thought other targets might
> > profit from it, too.  simplify_context::simplify_relational_operation_1
> > seems to be the wrong place since code/mode may change.  Any opinions?
> > 
> > gcc/ChangeLog:
> > 
> > * combine.cc (simplify_compare_const): Narrow comparison of
> > memory and constant.
> > (try_combine): Adapt new function signature.
> > (simplify_comparison): Adapt new function signature.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * gcc.target/s390/cmp-mem-const-1.c: New test.
> > * gcc.target/s390/cmp-mem-const-2.c: New test.
> This does seem more general than we'd want to do in the canonicalization
> hook.  So thanks for going the extra mile and doing a generic
> implementation.
> 
> 
> 
> 
> > @@ -11987,6 +11988,79 @@ simplify_compare_const (enum rtx_code code, 
> > machine_mode mode,
> > break;
> >   }
> > +  /* Narrow non-symmetric comparison of memory and constant as e.g.
> > + x0...x7 <= 0x3fff into x0 <= 0x3f where x0 is the most
> > + significant byte.  Likewise, transform x0...x7 >= 0x4000 
> > into
> > + x0 >= 0x40.  */
> > +  if ((code == LEU || code == LTU || code == GEU || code == GTU)
> > +  && is_a  (GET_MODE (op0), _mode)
> > +  && MEM_P (op0)
> > +  && !MEM_VOLATILE_P (op0)
> > +  && (unsigned HOST_WIDE_INT)const_op > 0xff)
> > +{
> > +  unsigned HOST_WIDE_INT n = (unsigned HOST_WIDE_INT)const_op;
> > +  enum rtx_code adjusted_code = code;
> > +
> > +  /* If the least significant bit is already zero, then adjust the
> > +comparison in the hope that we hit cases like
> > +  op0  <= 0x3dfe
> > +where the adjusted comparison
> > +  op0  <  0x3dff
> > +can be shortened into
> > +  op0' <  0x3d.  */
> > +  if (code == LEU && (n & 1) == 0)
> > +   {
> > + ++n;
> > + adjusted_code = LTU;
> > +   }
> > +  /* or e.g. op0 < 0x4020  */
> > +  else if (code == LTU && (n & 1) == 0)
> > +   {
> > + --n;
> > + adjusted_code = LEU;
> > +   }
> > +  /* or op0 >= 0x4001  */
> > +  else if (code == GEU && (n & 1) == 1)
> > +   {
> > + --n;
> > + adjusted_code = GTU;
> > +   }
> > +  /* or op0 > 0x3fff.  */
> > +  else if (code == GTU && (n & 1) == 1)
> > +   {
> > + ++n;
> > + adjusted_code = GEU;
> > +   }
> > +
> > +  scalar_int_mode narrow_mode_iter;
> > +  bool lower_p = code == LEU || code == LTU;
> > +  bool greater_p = !lower_p;
> > +  FOR_EACH_MODE_UNTIL (narrow_mode_iter, int_mode)
> > +   {
> > + unsigned nbits = GET_MODE_PRECISION (int_mode)
> > + - GET_MODE_PRECISION (narrow_mode_iter);
> > + unsigned HOST_WIDE_INT mask = (HOST_WIDE_INT_1U << nbits) - 1;
> > + unsigned HOST_WIDE_INT lower_bits = n & mask;
> > + if ((lower_p && lower_bits == mask)
> > + || (greater_p && lower_bits == 0))
> > +   {
> > + n >>= nbits;
> > + break;
> > +   }
> > +   }
> > +
> > +  if (narrow_mode_iter < int_mode)
> > +   {
> > + poly_int64 offset = BYTES_BIG_ENDIAN
> > +

[PATCH] combine: Narrow comparison of memory and constant

2023-06-12 Thread Stefan Schulze Frielinghaus via Gcc-patches

Comparisons between memory and constants might be done in a smaller mode
resulting in smaller constants which might finally end up as immediates
instead of in the literal pool.

For example, on s390x a non-symmetric comparison like
  x <= 0x3fff
results in the constant being spilled to the literal pool and an 8 byte
memory comparison is emitted.  Ideally, an equivalent comparison
  x0 <= 0x3f
where x0 is the most significant byte of x, is emitted where the
constant is smaller and more likely to materialize as an immediate.

Similarly, comparisons of the form
  x >= 0x4000
can be shortened into x0 >= 0x40.

I'm not entirely sure whether combine is the right place to implement
something like this.  In my first try I implemented it in
TARGET_CANONICALIZE_COMPARISON but then thought other targets might
profit from it, too.  simplify_context::simplify_relational_operation_1
seems to be the wrong place since code/mode may change.  Any opinions?

gcc/ChangeLog:

* combine.cc (simplify_compare_const): Narrow comparison of
memory and constant.
(try_combine): Adapt new function signature.
(simplify_comparison): Adapt new function signature.

gcc/testsuite/ChangeLog:

* gcc.target/s390/cmp-mem-const-1.c: New test.
* gcc.target/s390/cmp-mem-const-2.c: New test.
---
 gcc/combine.cc| 82 ++-
 .../gcc.target/s390/cmp-mem-const-1.c | 99 +++
 .../gcc.target/s390/cmp-mem-const-2.c | 23 +
 3 files changed, 200 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/s390/cmp-mem-const-1.c
 create mode 100644 gcc/testsuite/gcc.target/s390/cmp-mem-const-2.c

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 5aa0ec5c45a..6ad1600dc1b 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -460,7 +460,7 @@ static rtx simplify_shift_const (rtx, enum rtx_code, 
machine_mode, rtx,
 static int recog_for_combine (rtx *, rtx_insn *, rtx *);
 static rtx gen_lowpart_for_combine (machine_mode, rtx);
 static enum rtx_code simplify_compare_const (enum rtx_code, machine_mode,
-rtx, rtx *);
+rtx *, rtx *);
 static enum rtx_code simplify_comparison (enum rtx_code, rtx *, rtx *);
 static void update_table_tick (rtx);
 static void record_value_for_reg (rtx, rtx_insn *, rtx);
@@ -3185,7 +3185,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
  compare_code = orig_compare_code = GET_CODE (*cc_use_loc);
  if (is_a  (GET_MODE (i2dest), ))
compare_code = simplify_compare_const (compare_code, mode,
-  op0, );
+  , );
  target_canonicalize_comparison (_code, , , 1);
}
 
@@ -11800,9 +11800,10 @@ gen_lowpart_for_combine (machine_mode omode, rtx x)
 
 static enum rtx_code
 simplify_compare_const (enum rtx_code code, machine_mode mode,
-   rtx op0, rtx *pop1)
+   rtx *pop0, rtx *pop1)
 {
   scalar_int_mode int_mode;
+  rtx op0 = *pop0;
   HOST_WIDE_INT const_op = INTVAL (*pop1);
 
   /* Get the constant we are comparing against and turn off all bits
@@ -11987,6 +11988,79 @@ simplify_compare_const (enum rtx_code code, 
machine_mode mode,
   break;
 }
 
+  /* Narrow non-symmetric comparison of memory and constant as e.g.
+ x0...x7 <= 0x3fff into x0 <= 0x3f where x0 is the most
+ significant byte.  Likewise, transform x0...x7 >= 0x4000 into
+ x0 >= 0x40.  */
+  if ((code == LEU || code == LTU || code == GEU || code == GTU)
+  && is_a  (GET_MODE (op0), _mode)
+  && MEM_P (op0)
+  && !MEM_VOLATILE_P (op0)
+  && (unsigned HOST_WIDE_INT)const_op > 0xff)
+{
+  unsigned HOST_WIDE_INT n = (unsigned HOST_WIDE_INT)const_op;
+  enum rtx_code adjusted_code = code;
+
+  /* If the least significant bit is already zero, then adjust the
+comparison in the hope that we hit cases like
+  op0  <= 0x3dfe
+where the adjusted comparison
+  op0  <  0x3dff
+can be shortened into
+  op0' <  0x3d.  */
+  if (code == LEU && (n & 1) == 0)
+   {
+ ++n;
+ adjusted_code = LTU;
+   }
+  /* or e.g. op0 < 0x4020  */
+  else if (code == LTU && (n & 1) == 0)
+   {
+ --n;
+ adjusted_code = LEU;
+   }
+  /* or op0 >= 0x4001  */
+  else if (code == GEU && (n & 1) == 1)
+   {
+ --n;
+ adjusted_code = GTU;
+   }
+  /* or op0 > 0x3fff.  */
+  else if (code == GTU && (n & 1) == 1)
+   {
+ ++n;
+ adjusted_code = GEU;
+   }
+
+  scalar_int_mode narrow_mode_iter;
+  bool lower_p = code == LEU || code == LTU;
+  bool greater_p =

[PATCH] s390: Implement TARGET_ATOMIC_ALIGN_FOR_MODE

2023-05-16 Thread Stefan Schulze Frielinghaus via Gcc-patches

So far atomic objects are aligned according to their default alignment.
For 128 bit scalar types like int128 or long double this results in an
8 byte alignment which is wrong and must be 16 byte.

libstdc++ already computes a correct alignment, though, still adding a
test case in order to make sure that both implementations are
compatible.

Bootstrapped and regtested.  Ok for mainline?  Since this is an ABI
break, is a backport to GCC 13 reasonable?

gcc/ChangeLog:

* config/s390/s390.cc (TARGET_ATOMIC_ALIGN_FOR_MODE):
New.
(s390_atomic_align_for_mode): New.

gcc/testsuite/ChangeLog:

* g++.target/s390/atomic-align-1.C: New test.
* gcc.target/s390/atomic-align-1.c: New test.
* gcc.target/s390/atomic-align-2.c: New test.
---
 gcc/config/s390/s390.cc   |  8 ++
 .../g++.target/s390/atomic-align-1.C  | 25 +++
 .../gcc.target/s390/atomic-align-1.c  | 23 +
 .../gcc.target/s390/atomic-align-2.c  | 18 +
 4 files changed, 74 insertions(+)
 create mode 100644 gcc/testsuite/g++.target/s390/atomic-align-1.C
 create mode 100644 gcc/testsuite/gcc.target/s390/atomic-align-1.c
 create mode 100644 gcc/testsuite/gcc.target/s390/atomic-align-2.c

diff --git a/gcc/config/s390/s390.cc b/gcc/config/s390/s390.cc
index 505de995da8..4813bf91dc4 100644
--- a/gcc/config/s390/s390.cc
+++ b/gcc/config/s390/s390.cc
@@ -450,6 +450,14 @@ s390_preserve_fpr_arg_p (int regno)
  && regno >= FPR0_REGNUM);
 }
 
+#undef TARGET_ATOMIC_ALIGN_FOR_MODE
+#define TARGET_ATOMIC_ALIGN_FOR_MODE s390_atomic_align_for_mode
+static unsigned int
+s390_atomic_align_for_mode (machine_mode mode)
+{
+  return GET_MODE_BITSIZE (mode);
+}
+
 /* A couple of shortcuts.  */
 #define CONST_OK_FOR_J(x) \
CONST_OK_FOR_CONSTRAINT_P((x), 'J', "J")
diff --git a/gcc/testsuite/g++.target/s390/atomic-align-1.C 
b/gcc/testsuite/g++.target/s390/atomic-align-1.C
new file mode 100644
index 000..43aa0bc39ed
--- /dev/null
+++ b/gcc/testsuite/g++.target/s390/atomic-align-1.C
@@ -0,0 +1,25 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-std=c++11" } */
+/* { dg-final { scan-assembler-times {\.align\t2} 2 } } */
+/* { dg-final { scan-assembler-times {\.align\t4} 2 } } */
+/* { dg-final { scan-assembler-times {\.align\t8} 3 } } */
+/* { dg-final { scan-assembler-times {\.align\t16} 2 } } */
+
+#include 
+
+// 2
+std::atomic var_char;
+std::atomic var_short;
+// 4
+std::atomic var_int;
+// 8
+std::atomic var_long;
+std::atomic var_long_long;
+// 16
+std::atomic<__int128> var_int128;
+// 4
+std::atomic var_float;
+// 8
+std::atomic var_double;
+// 16
+std::atomic var_long_double;
diff --git a/gcc/testsuite/gcc.target/s390/atomic-align-1.c 
b/gcc/testsuite/gcc.target/s390/atomic-align-1.c
new file mode 100644
index 000..b2e1233e3ee
--- /dev/null
+++ b/gcc/testsuite/gcc.target/s390/atomic-align-1.c
@@ -0,0 +1,23 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-std=c11" } */
+/* { dg-final { scan-assembler-times {\.align\t2} 2 } } */
+/* { dg-final { scan-assembler-times {\.align\t4} 2 } } */
+/* { dg-final { scan-assembler-times {\.align\t8} 3 } } */
+/* { dg-final { scan-assembler-times {\.align\t16} 2 } } */
+
+// 2
+_Atomic char var_char;
+_Atomic short var_short;
+// 4
+_Atomic int var_int;
+// 8
+_Atomic long var_long;
+_Atomic long long var_long_long;
+// 16
+_Atomic __int128 var_int128;
+// 4
+_Atomic float var_float;
+// 8
+_Atomic double var_double;
+// 16
+_Atomic long double var_long_double;
diff --git a/gcc/testsuite/gcc.target/s390/atomic-align-2.c 
b/gcc/testsuite/gcc.target/s390/atomic-align-2.c
new file mode 100644
index 000..0bf17341bf8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/s390/atomic-align-2.c
@@ -0,0 +1,18 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O -std=c11" } */
+/* { dg-final { scan-assembler-not {abort} } } */
+
+/* The stack is 8 byte aligned which means GCC has to manually align a 16 byte
+   aligned object.  This is done by allocating not 16 but rather 24 bytes for
+   variable X and then manually aligning a pointer inside the memory block.
+   Validate this by ensuring that the if-statement is optimized out.  */
+
+void bar (_Atomic unsigned __int128 *ptr);
+
+void foo (void) {
+  _Atomic unsigned __int128 x;
+  unsigned long n = (unsigned long)
+  if (n % 16 != 0)
+__builtin_abort ();
+  bar ();
+}
-- 
2.39.2

[PATCH 3/3] s390: Refactor block operation setmem

2023-05-15 Thread Stefan Schulze Frielinghaus via Gcc-patches

Vectorize memset with a constant length of less than or equal to 64
bytes.

Do not perform a libc function call into memset in case the size is not
a compile-time constant but bounded and the upper bound is less than or
equal to 256 bytes.

gcc/ChangeLog:

* config/s390/s390-protos.h (s390_expand_setmem): Change
function signature.
* config/s390/s390.cc (s390_expand_setmem): For memset's less
than or equal to 256 byte do not perform a libc call.
* config/s390/s390.md: Change expander into a version which
takes 8 operands.

gcc/testsuite/ChangeLog:

* gcc.target/s390/memset-1.c: Test case memset1 makes use of
vst, now.
---
 gcc/config/s390/s390-protos.h|   2 +-
 gcc/config/s390/s390.cc  | 129 +--
 gcc/config/s390/s390.md  |  14 ++-
 gcc/testsuite/gcc.target/s390/memset-1.c |   7 +-
 4 files changed, 132 insertions(+), 20 deletions(-)

diff --git a/gcc/config/s390/s390-protos.h b/gcc/config/s390/s390-protos.h
index 65e4f97b41e..4a5263fccec 100644
--- a/gcc/config/s390/s390-protos.h
+++ b/gcc/config/s390/s390-protos.h
@@ -109,7 +109,7 @@ extern void emit_symbolic_move (rtx *);
 extern void s390_load_address (rtx, rtx);
 extern bool s390_expand_cpymem (rtx, rtx, rtx, rtx, rtx);
 extern bool s390_expand_movmem (rtx, rtx, rtx, rtx, rtx);
-extern void s390_expand_setmem (rtx, rtx, rtx);
+extern void s390_expand_setmem (rtx, rtx, rtx, rtx, rtx);
 extern bool s390_expand_cmpmem (rtx, rtx, rtx, rtx);
 extern void s390_expand_vec_strlen (rtx, rtx, rtx);
 extern void s390_expand_vec_movstr (rtx, rtx, rtx);
diff --git a/gcc/config/s390/s390.cc b/gcc/config/s390/s390.cc
index 553273f23ff..b1cb54612b8 100644
--- a/gcc/config/s390/s390.cc
+++ b/gcc/config/s390/s390.cc
@@ -5910,20 +5910,62 @@ s390_expand_movmem (rtx dst, rtx src, rtx len, rtx 
min_len_rtx, rtx max_len_rtx)
Make use of clrmem if VAL is zero.  */
 
 void
-s390_expand_setmem (rtx dst, rtx len, rtx val)
+s390_expand_setmem (rtx dst, rtx len, rtx val, rtx min_len_rtx, rtx 
max_len_rtx)
 {
-  if (GET_CODE (len) == CONST_INT && INTVAL (len) <= 0)
+  /* Exit early in case nothing has to be done.  */
+  if (CONST_INT_P (len) && UINTVAL (len) == 0)
 return;
 
   gcc_assert (GET_CODE (val) == CONST_INT || GET_MODE (val) == QImode);
 
+  unsigned HOST_WIDE_INT min_len = UINTVAL (min_len_rtx);
+  unsigned HOST_WIDE_INT max_len
+= max_len_rtx ? UINTVAL (max_len_rtx) : HOST_WIDE_INT_M1U;
+
+  /* Vectorize memset with a constant length
+   - if  0 <  LEN <  16, then emit a vstl based solution;
+   - if 16 <= LEN <= 64, then emit a vst based solution
+ where the last two vector stores may overlap in case LEN%16!=0.  Paying
+ the price for an overlap is negligible compared to an extra GPR which is
+ required for vstl.  */
+  if (CONST_INT_P (len) && UINTVAL (len) <= 64 && val != const0_rtx
+  && TARGET_VX)
+{
+  rtx val_vec = gen_reg_rtx (V16QImode);
+  emit_move_insn (val_vec, gen_rtx_VEC_DUPLICATE (V16QImode, val));
+
+  if (UINTVAL (len) < 16)
+   {
+ rtx len_reg = gen_reg_rtx (SImode);
+ emit_move_insn (len_reg, GEN_INT (UINTVAL (len) - 1));
+ emit_insn (gen_vstlv16qi (val_vec, len_reg, dst));
+   }
+  else
+   {
+ unsigned HOST_WIDE_INT l = UINTVAL (len) / 16;
+ unsigned HOST_WIDE_INT r = UINTVAL (len) % 16;
+ unsigned HOST_WIDE_INT o = 0;
+ for (unsigned HOST_WIDE_INT i = 0; i < l; ++i)
+   {
+ rtx newdst = adjust_address (dst, V16QImode, o);
+ emit_move_insn (newdst, val_vec);
+ o += 16;
+   }
+ if (r != 0)
+   {
+ rtx newdst = adjust_address (dst, V16QImode, (o - 16) + r);
+ emit_move_insn (newdst, val_vec);
+   }
+   }
+}
+
   /* Expand setmem/clrmem for a constant length operand without a
  loop if it will be shorter that way.
  clrmem loop (with PFD)is 30 bytes -> 5 * xc
  clrmem loop (without PFD) is 24 bytes -> 4 * xc
  setmem loop (with PFD)is 38 bytes -> ~4 * (mvi/stc + mvc)
  setmem loop (without PFD) is 32 bytes -> ~4 * (mvi/stc + mvc) */
-  if (GET_CODE (len) == CONST_INT
+  else if (GET_CODE (len) == CONST_INT
   && ((val == const0_rtx
   && (INTVAL (len) <= 256 * 4
   || (INTVAL (len) <= 256 * 5 && TARGET_SETMEM_PFD(val,len
@@ -5968,6 +6010,70 @@ s390_expand_setmem (rtx dst, rtx len, rtx val)
   val));
 }
 
+  /* Non-constant length and no loop required.  */
+  else if (!CONST_INT_P (len) && max_len <= 256)
+{
+  rtx_code_label *end_label;
+
+  if (min_len == 0)
+   {
+ end_label = gen_label_rtx ();
+ emit_cmp_and_jump_insns (len, const0_rtx, EQ, NULL_RTX,
+  GET_MODE (len), 1, end_label,
+

[PATCH 1/3] s390: Refactor block operation cpymem

2023-05-15 Thread Stefan Schulze Frielinghaus via Gcc-patches

Do not perform a libc function call into memcpy in case the size is not
a compile-time constant but bounded and the upper bound is less than or
equal to 256 bytes.

gcc/ChangeLog:

* config/s390/s390-protos.h (s390_expand_cpymem): Change
function signature.
* config/s390/s390.cc (s390_expand_cpymem): For memcpy's less
than or equal to 256 byte do not perform a libc call.
(s390_expand_insv): Adapt new function signature of
s390_expand_cpymem.
* config/s390/s390.md: Change expander into a version which
takes 8 operands.
---
 gcc/config/s390/s390-protos.h |  2 +-
 gcc/config/s390/s390.cc   | 84 +++
 gcc/config/s390/s390.md   | 10 +++--
 3 files changed, 74 insertions(+), 22 deletions(-)

diff --git a/gcc/config/s390/s390-protos.h b/gcc/config/s390/s390-protos.h
index 67fe09e732d..2c7495ca247 100644
--- a/gcc/config/s390/s390-protos.h
+++ b/gcc/config/s390/s390-protos.h
@@ -107,7 +107,7 @@ extern void s390_reload_symref_address (rtx , rtx , rtx , 
bool);
 extern void s390_expand_plus_operand (rtx, rtx, rtx);
 extern void emit_symbolic_move (rtx *);
 extern void s390_load_address (rtx, rtx);
-extern bool s390_expand_cpymem (rtx, rtx, rtx);
+extern bool s390_expand_cpymem (rtx, rtx, rtx, rtx, rtx);
 extern void s390_expand_setmem (rtx, rtx, rtx);
 extern bool s390_expand_cmpmem (rtx, rtx, rtx, rtx);
 extern void s390_expand_vec_strlen (rtx, rtx, rtx);
diff --git a/gcc/config/s390/s390.cc b/gcc/config/s390/s390.cc
index 505de995da8..95ea5e8d009 100644
--- a/gcc/config/s390/s390.cc
+++ b/gcc/config/s390/s390.cc
@@ -5650,27 +5650,27 @@ legitimize_reload_address (rtx ad, machine_mode mode 
ATTRIBUTE_UNUSED,
   return NULL_RTX;
 }
 
-/* Emit code to move LEN bytes from DST to SRC.  */
+/* Emit code to move LEN bytes from SRC to DST.  */
 
 bool
-s390_expand_cpymem (rtx dst, rtx src, rtx len)
+s390_expand_cpymem (rtx dst, rtx src, rtx len, rtx min_len_rtx, rtx 
max_len_rtx)
 {
-  /* When tuning for z10 or higher we rely on the Glibc functions to
- do the right thing. Only for constant lengths below 64k we will
- generate inline code.  */
-  if (s390_tune >= PROCESSOR_2097_Z10
-  && (GET_CODE (len) != CONST_INT || INTVAL (len) > (1<<16)))
-return false;
+  /* Exit early in case nothing has to be done.  */
+  if (CONST_INT_P (len) && UINTVAL (len) == 0)
+return true;
+
+  unsigned HOST_WIDE_INT min_len = UINTVAL (min_len_rtx);
+  unsigned HOST_WIDE_INT max_len
+= max_len_rtx ? UINTVAL (max_len_rtx) : HOST_WIDE_INT_M1U;
 
   /* Expand memcpy for constant length operands without a loop if it
  is shorter that way.
 
  With a constant length argument a
  memcpy loop (without pfd) is 36 bytes -> 6 * mvc  */
-  if (GET_CODE (len) == CONST_INT
-  && INTVAL (len) >= 0
-  && INTVAL (len) <= 256 * 6
-  && (!TARGET_MVCLE || INTVAL (len) <= 256))
+  if (CONST_INT_P (len)
+  && UINTVAL (len) <= 6 * 256
+  && (!TARGET_MVCLE || UINTVAL (len) <= 256))
 {
   HOST_WIDE_INT o, l;
 
@@ -5681,14 +5681,57 @@ s390_expand_cpymem (rtx dst, rtx src, rtx len)
  emit_insn (gen_cpymem_short (newdst, newsrc,
   GEN_INT (l > 256 ? 255 : l - 1)));
}
+
+  return true;
 }
 
-  else if (TARGET_MVCLE)
+  else if (TARGET_MVCLE
+  && (s390_tune < PROCESSOR_2097_Z10
+  || (CONST_INT_P (len) && UINTVAL (len) <= (1 << 16
 {
   emit_insn (gen_cpymem_long (dst, src, convert_to_mode (Pmode, len, 1)));
+  return true;
 }
 
-  else
+  /* Non-constant length and no loop required.  */
+  else if (!CONST_INT_P (len) && max_len <= 256)
+{
+  rtx_code_label *end_label;
+
+  if (min_len == 0)
+   {
+ end_label = gen_label_rtx ();
+ emit_cmp_and_jump_insns (len, const0_rtx, EQ, NULL_RTX,
+  GET_MODE (len), 1, end_label,
+  profile_probability::very_unlikely ());
+   }
+
+  rtx lenm1 = expand_binop (GET_MODE (len), add_optab, len, constm1_rtx,
+   NULL_RTX, 1, OPTAB_DIRECT);
+
+  /* Prefer a vectorized implementation over one which makes use of an
+execute instruction since it is faster (although it increases register
+pressure).  */
+  if (max_len <= 16 && TARGET_VX)
+   {
+ rtx tmp = gen_reg_rtx (V16QImode);
+ lenm1 = convert_to_mode (SImode, lenm1, 1);
+ emit_insn (gen_vllv16qi (tmp, lenm1, src));
+ emit_insn (gen_vstlv16qi (tmp, lenm1, dst));
+   }
+  else if (TARGET_Z15)
+   emit_insn (gen_mvcrl (dst, src, convert_to_mode (SImode, lenm1, 1)));
+  else
+   emit_insn (
+ gen_cpymem_short (dst, src, convert_to_mode (Pmode, lenm1, 1)));
+
+  if (min_len == 0)
+   emit_label (end_label);
+
+  return true;
+}
+
+  else if (s390_tune < PROCESSOR_2097_Z10 || (CONST_INT_P (len) &&

[PATCH 2/3] s390: Add block operation movmem

2023-05-15 Thread Stefan Schulze Frielinghaus via Gcc-patches

gcc/ChangeLog:

* config/s390/s390-protos.h (s390_expand_movmem): New.
* config/s390/s390.cc (s390_expand_movmem): New.
* config/s390/s390.md (movmem): New.
(*mvcrl): New.
(mvcrl): New.
---
 gcc/config/s390/s390-protos.h |  1 +
 gcc/config/s390/s390.cc   | 88 +++
 gcc/config/s390/s390.md   | 35 ++
 3 files changed, 124 insertions(+)

diff --git a/gcc/config/s390/s390-protos.h b/gcc/config/s390/s390-protos.h
index 2c7495ca247..65e4f97b41e 100644
--- a/gcc/config/s390/s390-protos.h
+++ b/gcc/config/s390/s390-protos.h
@@ -108,6 +108,7 @@ extern void s390_expand_plus_operand (rtx, rtx, rtx);
 extern void emit_symbolic_move (rtx *);
 extern void s390_load_address (rtx, rtx);
 extern bool s390_expand_cpymem (rtx, rtx, rtx, rtx, rtx);
+extern bool s390_expand_movmem (rtx, rtx, rtx, rtx, rtx);
 extern void s390_expand_setmem (rtx, rtx, rtx);
 extern bool s390_expand_cmpmem (rtx, rtx, rtx, rtx);
 extern void s390_expand_vec_strlen (rtx, rtx, rtx);
diff --git a/gcc/config/s390/s390.cc b/gcc/config/s390/s390.cc
index 95ea5e8d009..553273f23ff 100644
--- a/gcc/config/s390/s390.cc
+++ b/gcc/config/s390/s390.cc
@@ -5818,6 +5818,94 @@ s390_expand_cpymem (rtx dst, rtx src, rtx len, rtx 
min_len_rtx, rtx max_len_rtx)
   return false;
 }
 
+bool
+s390_expand_movmem (rtx dst, rtx src, rtx len, rtx min_len_rtx, rtx 
max_len_rtx)
+{
+  /* Exit early in case nothing has to be done.  */
+  if (CONST_INT_P (len) && UINTVAL (len) == 0)
+return true;
+  /* Exit early in case length is not upper bounded.  */
+  else if (max_len_rtx == NULL)
+return false;
+
+  unsigned HOST_WIDE_INT min_len = UINTVAL (min_len_rtx);
+  unsigned HOST_WIDE_INT max_len = UINTVAL (max_len_rtx);
+
+  /* At most 16 bytes.  */
+  if (max_len <= 16 && TARGET_VX)
+{
+  rtx_code_label *end_label;
+
+  if (min_len == 0)
+   {
+ end_label = gen_label_rtx ();
+ emit_cmp_and_jump_insns (len, const0_rtx, EQ, NULL_RTX,
+  GET_MODE (len), 1, end_label,
+  profile_probability::very_unlikely ());
+   }
+
+  rtx lenm1;
+  if (CONST_INT_P (len))
+   {
+ lenm1 = gen_reg_rtx (SImode);
+ emit_move_insn (lenm1, GEN_INT (UINTVAL (len) - 1));
+   }
+  else
+   lenm1
+ = expand_binop (SImode, add_optab, convert_to_mode (SImode, len, 1),
+ constm1_rtx, NULL_RTX, 1, OPTAB_DIRECT);
+
+  rtx tmp = gen_reg_rtx (V16QImode);
+  emit_insn (gen_vllv16qi (tmp, lenm1, src));
+  emit_insn (gen_vstlv16qi (tmp, lenm1, dst));
+
+  if (min_len == 0)
+   emit_label (end_label);
+
+  return true;
+}
+
+  /* At most 256 bytes.  */
+  else if (max_len <= 256 && TARGET_Z15)
+{
+  rtx_code_label *end_label = gen_label_rtx ();
+
+  if (min_len == 0)
+   emit_cmp_and_jump_insns (len, const0_rtx, EQ, NULL_RTX, GET_MODE (len),
+1, end_label,
+profile_probability::very_unlikely ());
+
+  rtx dst_addr = gen_reg_rtx (Pmode);
+  rtx src_addr = gen_reg_rtx (Pmode);
+  emit_move_insn (dst_addr, force_operand (XEXP (dst, 0), NULL_RTX));
+  emit_move_insn (src_addr, force_operand (XEXP (src, 0), NULL_RTX));
+
+  rtx lenm1 = CONST_INT_P (len)
+   ? GEN_INT (UINTVAL (len) - 1)
+   : expand_binop (GET_MODE (len), add_optab, len, constm1_rtx,
+   NULL_RTX, 1, OPTAB_DIRECT);
+
+  rtx_code_label *right_to_left_label = gen_label_rtx ();
+  emit_cmp_and_jump_insns (src_addr, dst_addr, LT, NULL_RTX, GET_MODE 
(len),
+  1, right_to_left_label);
+
+  // MVC
+  emit_insn (
+   gen_cpymem_short (dst, src, convert_to_mode (Pmode, lenm1, 1)));
+  emit_jump (end_label);
+
+  // MVCRL
+  emit_label (right_to_left_label);
+  emit_insn (gen_mvcrl (dst, src, convert_to_mode (SImode, lenm1, 1)));
+
+  emit_label (end_label);
+
+  return true;
+}
+
+  return false;
+}
+
 /* Emit code to set LEN bytes at DST to VAL.
Make use of clrmem if VAL is zero.  */
 
diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index d9ce287ab85..abe3bbc5cd9 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -61,6 +61,7 @@
UNSPEC_ROUND
UNSPEC_ICM
UNSPEC_TIE
+   UNSPEC_MVCRL
 
; Convert CC into a str comparison result and copy it into an
; integer register
@@ -3496,6 +3497,40 @@
   [(set_attr "length" "8")
(set_attr "type" "vs")])
 
+(define_expand "movmem"
+  [(set (match_operand:BLK 0 "memory_operand")   ; destination
+(match_operand:BLK 1 "memory_operand"))  ; source
+   (use (match_operand:GPR 2 "general_operand")) ; size
+   (match_operand 3 "")  ; align
+   (match_operand 4 "")  ; expected align
+   (match_operand 5 "")  ; expected size
+   (match_operand

[PATCH 0/3] Refactor memory block operations

2023-05-15 Thread Stefan Schulze Frielinghaus via Gcc-patches

Bootstrapped and regtested.  Ok for mainline?

Stefan Schulze Frielinghaus (3):
  s390: Refactor block operation cpymem
  s390: Add block operation movmem
  s390: Refactor block operation setmem

 gcc/config/s390/s390-protos.h|   5 +-
 gcc/config/s390/s390.cc  | 301 ---
 gcc/config/s390/s390.md  |  61 -
 gcc/testsuite/gcc.target/s390/memset-1.c |   7 +-
 4 files changed, 331 insertions(+), 43 deletions(-)

-- 
2.39.2

[PATCH] s390: libatomic: Fix 16 byte atomic {cas,load,store}

2023-03-02 Thread Stefan Schulze Frielinghaus via Gcc-patches

This is a follow-up to commit a4c6bd0821099f6b8c0f64a96ffd9d01a025c413
introducing a runtime check for alignment for 16 byte atomic
compare-exchange, load, and store.

Bootstrapped and regtested on s390.
Ok for mainline and gcc-{12,11,10}?

libatomic/ChangeLog:

* config/s390/cas_n.c: New file.
* config/s390/load_n.c: New file.
* config/s390/store_n.c: New file.
---
 libatomic/config/s390/cas_n.c   | 65 +
 libatomic/config/s390/load_n.c  | 57 +
 libatomic/config/s390/store_n.c | 54 +++
 3 files changed, 176 insertions(+)
 create mode 100644 libatomic/config/s390/cas_n.c
 create mode 100644 libatomic/config/s390/load_n.c
 create mode 100644 libatomic/config/s390/store_n.c

diff --git a/libatomic/config/s390/cas_n.c b/libatomic/config/s390/cas_n.c
new file mode 100644
index 000..44b7152ca5d
--- /dev/null
+++ b/libatomic/config/s390/cas_n.c
@@ -0,0 +1,65 @@
+/* Copyright (C) 2018-2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Atomic Library (libatomic).
+
+   Libatomic is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   Libatomic is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+#include 
+
+
+/* Analog to config/s390/exch_n.c.  */
+
+#if !DONE && N == 16
+bool
+SIZE(libat_compare_exchange) (UTYPE *mptr, UTYPE *eptr, UTYPE newval,
+ int smodel, int fmodel UNUSED)
+{
+  if (!((uintptr_t)mptr & 0xf))
+{
+  return __atomic_compare_exchange_n (
+   (UTYPE *)__builtin_assume_aligned (mptr, 16), eptr, newval, false,
+   __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
+}
+  else
+{
+  UTYPE oldval;
+  UWORD magic;
+  bool ret;
+
+  pre_seq_barrier (smodel);
+  magic = protect_start (mptr);
+
+  oldval = *mptr;
+  ret = (oldval == *eptr);
+  if (ret)
+   *mptr = newval;
+  else
+   *eptr = oldval;
+
+  protect_end (mptr, magic);
+  post_seq_barrier (smodel);
+
+  return ret;
+}
+}
+#define DONE 1
+#endif /* N == 16 */
+
+#include "../../cas_n.c"
diff --git a/libatomic/config/s390/load_n.c b/libatomic/config/s390/load_n.c
new file mode 100644
index 000..335d2f8b2c3
--- /dev/null
+++ b/libatomic/config/s390/load_n.c
@@ -0,0 +1,57 @@
+/* Copyright (C) 2018-2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Atomic Library (libatomic).
+
+   Libatomic is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   Libatomic is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+#include 
+
+
+/* Analog to config/s390/exch_n.c.  */
+
+#if !DONE && N == 16
+UTYPE
+SIZE(libat_load) (UTYPE *mptr, int smodel)
+{
+  if (!((uintptr_t)mptr & 0xf))
+{
+  return __atomic_load_n ((UTYPE *)__builtin_assume_aligned (mptr, 16),
+ __ATOMIC_SEQ_CST);
+}
+  else
+{
+  UTYPE ret;
+  UWORD magic;
+
+  pre_seq_barrier (smodel);
+  magic = protect_start (mptr);
+
+  ret = *mptr;
+
+  protect_end (mptr, magic);
+  post_seq_barrier (smodel);
+
+  return ret;
+}
+}
+#define DONE 1
+#endif /* N == 16 */
+
+#include "../../load_n.c"
diff --git a/libatomic/config/s390/store_n.c b/libatomic/config/s390/store_n.c
new file mode 100644
index 000..9e5b2b8213d
--- /dev/null
+++

[PATCH] IBM zSystems: Fix predicate execute_operation

2023-02-11 Thread Stefan Schulze Frielinghaus via Gcc-patches

Use constrain_operands in order to check whether there exists a valid
alternative instead of extract_constrain_insn which ICEs in case no
alternative is found.

Bootstrapped and regtested on IBM zSystems.  Ok for mainline?

gcc/ChangeLog:

* config/s390/predicates.md (execute_operation): Use
constrain_operands instead of extract_constrain_insn in order to
determine wheter there exists a valid alternative.
---
 gcc/config/s390/predicates.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/s390/predicates.md b/gcc/config/s390/predicates.md
index 404e8d87b63..d5d5a7cc0d3 100644
--- a/gcc/config/s390/predicates.md
+++ b/gcc/config/s390/predicates.md
@@ -479,9 +479,9 @@
   if (icode < 0)
 return false;
 
-  extract_constrain_insn (insn);
+  extract_insn (insn);
 
-  return which_alternative >= 0;
+  return constrain_operands (reload_completed, get_enabled_alternatives 
(insn)) == 1;
 })
 
 ;; Return true if OP is a store multiple operation.  It is known to be a
-- 
2.39.1

[PATCH] IBM zSystems: Do not propagate scheduler state across basic blocks [PR108102]

2023-02-11 Thread Stefan Schulze Frielinghaus via Gcc-patches

So far we propagate scheduler state across basic blocks within EBBs and
reset the state otherwise.  In certain circumstances the entry block of
an EBB might be empty, i.e., no_real_insns_p is true.  In those cases
scheduler state is not reset and subsequently wrong state is propagated
to following blocks of the same EBB.

Since the performance benefit of tracking state across basic blocks is
questionable on modern hardware, simply reset the state for each basic
block.

Fix also resetting f{p,x}d_longrunning.

Bootstrapped and regtested on IBM zSystems.  Ok for mainline?

gcc/ChangeLog:

* config/s390/s390.cc (s390_bb_fallthru_entry_likely): Remove.
(struct s390_sched_state): Initialise to zero.
(s390_sched_variable_issue): For better debuggability also emit
the current side.
(s390_sched_init): Unconditionally reset scheduler state.
---
 gcc/config/s390/s390.cc | 42 +++--
 1 file changed, 7 insertions(+), 35 deletions(-)

diff --git a/gcc/config/s390/s390.cc b/gcc/config/s390/s390.cc
index a9bb610385b..9317f33e9c9 100644
--- a/gcc/config/s390/s390.cc
+++ b/gcc/config/s390/s390.cc
@@ -14872,29 +14872,6 @@ s390_z10_prevent_earlyload_conflicts (rtx_insn 
**ready, int *nready_p)
   ready[0] = tmp;
 }
 
-/* Returns TRUE if BB is entered via a fallthru edge and all other
-   incoming edges are less than likely.  */
-static bool
-s390_bb_fallthru_entry_likely (basic_block bb)
-{
-  edge e, fallthru_edge;
-  edge_iterator ei;
-
-  if (!bb)
-return false;
-
-  fallthru_edge = find_fallthru_edge (bb->preds);
-  if (!fallthru_edge)
-return false;
-
-  FOR_EACH_EDGE (e, ei, bb->preds)
-if (e != fallthru_edge
-   && e->probability >= profile_probability::likely ())
-  return false;
-
-  return true;
-}
-
 struct s390_sched_state
 {
   /* Number of insns in the group.  */
@@ -14905,7 +14882,7 @@ struct s390_sched_state
   bool group_of_two;
 } s390_sched_state;
 
-static struct s390_sched_state sched_state = {0, 1, false};
+static struct s390_sched_state sched_state;
 
 #define S390_SCHED_ATTR_MASK_CRACKED0x1
 #define S390_SCHED_ATTR_MASK_EXPANDED   0x2
@@ -15405,7 +15382,7 @@ s390_sched_variable_issue (FILE *file, int verbose, 
rtx_insn *insn, int more)
 
  s390_get_unit_mask (insn, );
 
- fprintf (file, ";;\t\tBACKEND: units on this side unused for: ");
+ fprintf (file, ";;\t\tBACKEND: units on this side (%d) unused 
for: ", sched_state.side);
  for (j = 0; j < units; j++)
fprintf (file, "%d:%d ", j,
last_scheduled_unit_distance[j][sched_state.side]);
@@ -15443,17 +15420,12 @@ s390_sched_init (FILE *file ATTRIBUTE_UNUSED,
  current_sched_info->prev_head is the insn before the first insn of the
  block of insns to be scheduled.
  */
-  rtx_insn *insn = current_sched_info->prev_head
-? NEXT_INSN (current_sched_info->prev_head) : NULL;
-  basic_block bb = insn ? BLOCK_FOR_INSN (insn) : NULL;
-  if (s390_tune < PROCESSOR_2964_Z13 || !s390_bb_fallthru_entry_likely (bb))
-{
-  last_scheduled_insn = NULL;
-  memset (last_scheduled_unit_distance, 0,
+  last_scheduled_insn = NULL;
+  memset (last_scheduled_unit_distance, 0,
  MAX_SCHED_UNITS * NUM_SIDES * sizeof (int));
-  sched_state.group_state = 0;
-  sched_state.group_of_two = false;
-}
+  memset (fpd_longrunning, 0, NUM_SIDES * sizeof (int));
+  memset (fxd_longrunning, 0, NUM_SIDES * sizeof (int));
+  sched_state = {};
 }
 
 /* This target hook implementation for TARGET_LOOP_UNROLL_ADJUST calculates
-- 
2.39.1

[PATCH v2] IBM zSystems: Fix TARGET_D_CPU_VERSIONS

2023-01-24 Thread Stefan Schulze Frielinghaus via Gcc-patches

In the context of D the interpretation of S390, S390X, and SystemZ is a
bit fuzzy.  The wording S390X was wrongly deprecated in favour of
SystemZ by commit
https://github.com/dlang/dlang.org/commit/3b50a4c3faf01c32234d0ef8be5f82915a61c23f
Thus, SystemZ is used for 64-bit targets, now, and S390 for 31-bit
targets.  However, in TARGET_D_CPU_VERSIONS depending on TARGET_ZARCH we
set the CPU version to SystemZ.  This is also the case if compiled for
31-bit targets leading to the following error:

libphobos/libdruntime/core/sys/posix/sys/stat.d:967:13: error: static assert:  
'96u == 144u' is false
  967 | static assert(stat_t.sizeof == 144);
  | ^

Thus in order to keep this patch simple I went for keeping SystemZ for
64-bit targets and S390, as usual, for 31-bit targets and dropped the
distinction between ESA and z/Architecture.

Bootstrapped and regtested on IBM zSystems.  Ok for mainline?

gcc/ChangeLog:

* config/s390/s390-d.cc (s390_d_target_versions): Fix detection
of CPU version.
---
 gcc/config/s390/s390-d.cc | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/gcc/config/s390/s390-d.cc b/gcc/config/s390/s390-d.cc
index d10b45f7de4..6e9c80f7283 100644
--- a/gcc/config/s390/s390-d.cc
+++ b/gcc/config/s390/s390-d.cc
@@ -30,10 +30,11 @@ along with GCC; see the file COPYING3.  If not see
 void
 s390_d_target_versions (void)
 {
-  if (TARGET_ZARCH)
-d_add_builtin_version ("SystemZ");
-  else if (TARGET_64BIT)
-d_add_builtin_version ("S390X");
+  if (TARGET_64BIT)
+{
+  d_add_builtin_version ("S390X");
+  d_add_builtin_version ("SystemZ");
+}
   else
 d_add_builtin_version ("S390");
 
-- 
2.39.0

Re: [PATCH] IBM zSystems: Fix TARGET_D_CPU_VERSIONS

2023-01-23 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Mon, Jan 23, 2023 at 02:21:46PM +0100, Iain Buclaw wrote:
> Excerpts from Stefan Schulze Frielinghaus via Gcc-patches's message of Januar 
> 13, 2023 6:54 pm:
> > In the context of D the interpretation of S390, S390X, and SystemZ is a
> > bit fuzzy.  The wording S390X was wrongly deprecated in favour of
> > SystemZ by commit
> > https://github.com/dlang/dlang.org/commit/3b50a4c3faf01c32234d0ef8be5f82915a61c23f
> > Thus, SystemZ is used for 64-bit targets, now, and S390 for 31-bit
> > targets.  However, in TARGET_D_CPU_VERSIONS depending on TARGET_ZARCH we
> > set the CPU version to SystemZ.  This is also the case if compiled for
> > 31-bit targets leading to the following error:
> > 
> > libphobos/libdruntime/core/sys/posix/sys/stat.d:967:13: error: static 
> > assert:  '96u == 144u' is false
> >   967 | static assert(stat_t.sizeof == 144);
> >   | ^
> > 
> 
> So that I follow, there are three possible combinations?
> 
> ESA 31-bit (S390)
> ESA 64-bit (what was S390X)
> z/Arch 64-bit (SystemZ)

There are three combinations:

- s390:  32-bit ABI and ESA mode
- s390:  32-bit ABI and z/Architecture mode
- s390x: 64-bit ABI and z/Architecture mode

Note, depending on the CPU mode z/Architecture is supported by the
32- and 64-bit ABI whereas ESA is only supported by the 32-bit ABI.

Thus, s390 always refers to the 32-bit ABI but does not fix the
instructions set architecture (ESA or z/Architecture).  Whereas s390x
refers to the 64-bit ABI for which only z/Architecture exists.

While nitpicking, typically the target is written in lower case letters,
i.e., not S390X but s390x and likewise s390 instead of S390.

Hope this clarifies the set of possible combinations.  Let me know if
anything else is unclear.

> 
> > Thus in order to keep this patch simple I went for keeping SystemZ for
> > 64-bit targets and S390, as usual, for 31-bit targets and dropped the
> > distinction between ESA and z/Architecture.
> > 
> > Bootstrapped and regtested on IBM zSystems.  Ok for mainline?
> > 
> 
> OK by me.  Maybe keep both S390X and SystemZ for TARGET_64BIT? There's
> only ever been a binary distinction as far as I'm aware.

Sounds good to me.  I will come up with an updated patch.

Cheers,
Stefan

[PATCH] IBM zSystems: Fix TARGET_D_CPU_VERSIONS

2023-01-13 Thread Stefan Schulze Frielinghaus via Gcc-patches

In the context of D the interpretation of S390, S390X, and SystemZ is a
bit fuzzy.  The wording S390X was wrongly deprecated in favour of
SystemZ by commit
https://github.com/dlang/dlang.org/commit/3b50a4c3faf01c32234d0ef8be5f82915a61c23f
Thus, SystemZ is used for 64-bit targets, now, and S390 for 31-bit
targets.  However, in TARGET_D_CPU_VERSIONS depending on TARGET_ZARCH we
set the CPU version to SystemZ.  This is also the case if compiled for
31-bit targets leading to the following error:

libphobos/libdruntime/core/sys/posix/sys/stat.d:967:13: error: static assert:  
'96u == 144u' is false
  967 | static assert(stat_t.sizeof == 144);
  | ^

Thus in order to keep this patch simple I went for keeping SystemZ for
64-bit targets and S390, as usual, for 31-bit targets and dropped the
distinction between ESA and z/Architecture.

Bootstrapped and regtested on IBM zSystems.  Ok for mainline?

gcc/ChangeLog:

* config/s390/s390-d.cc (s390_d_target_versions): Fix detection
of CPU version.
---
 gcc/config/s390/s390-d.cc | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/gcc/config/s390/s390-d.cc b/gcc/config/s390/s390-d.cc
index d10b45f7de4..ced7f49a988 100644
--- a/gcc/config/s390/s390-d.cc
+++ b/gcc/config/s390/s390-d.cc
@@ -30,10 +30,8 @@ along with GCC; see the file COPYING3.  If not see
 void
 s390_d_target_versions (void)
 {
-  if (TARGET_ZARCH)
+  if (TARGET_64BIT)
 d_add_builtin_version ("SystemZ");
-  else if (TARGET_64BIT)
-d_add_builtin_version ("S390X");
   else
 d_add_builtin_version ("S390");
 
-- 
2.39.0

Re: [PATCH] cselib: Skip BImode while keeping track of subvalue relations [PR107088]

2022-10-06 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Wed, Oct 05, 2022 at 08:48:13PM -0600, Jeff Law via Gcc-patches wrote:
> 
> On 10/4/22 05:28, Stefan Schulze Frielinghaus via Gcc-patches wrote:
> > For BImode get_narrowest_mode evaluates to QImode but BImode < QImode.
> > Thus FOR_EACH_MODE_UNTIL never reaches BImode and iterates until OImode
> > for which no wider mode exists so we end up with VOIDmode and fail.
> > Fixed by adding a size guard so we effectively skip BImode.
> > 
> > Bootstrap and regtest are currently running on x64.  Assuming they pass
> > ok for mainline?
> > 
> > gcc/ChangeLog:
> > 
> > PR rtl-optimization/107088
> > * cselib.cc (new_cselib_val): Skip BImode while keeping track of
> > subvalue relations.
> 
> OK.  And FWIW, this fixes the various failures I saw in my tester due to the
> cselib patches.

Thanks for testing, too!  Out of curiosity which target is your tester?
I gave it a try on x64 and AArch64 for which bootstrap went fine and
regtest showed no difference, and of course, for s390x regtest went for
the better.

[PATCH] cselib: Skip BImode while keeping track of subvalue relations [PR107088]

2022-10-04 Thread Stefan Schulze Frielinghaus via Gcc-patches

For BImode get_narrowest_mode evaluates to QImode but BImode < QImode.
Thus FOR_EACH_MODE_UNTIL never reaches BImode and iterates until OImode
for which no wider mode exists so we end up with VOIDmode and fail.
Fixed by adding a size guard so we effectively skip BImode.

Bootstrap and regtest are currently running on x64.  Assuming they pass
ok for mainline?

gcc/ChangeLog:

PR rtl-optimization/107088
* cselib.cc (new_cselib_val): Skip BImode while keeping track of
subvalue relations.
---
 gcc/cselib.cc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/cselib.cc b/gcc/cselib.cc
index 9b582e5d3d6..2abc763a3f8 100644
--- a/gcc/cselib.cc
+++ b/gcc/cselib.cc
@@ -1571,6 +1571,7 @@ new_cselib_val (unsigned int hash, machine_mode mode, rtx 
x)
 
   scalar_int_mode int_mode;
   if (REG_P (x) && is_int_mode (mode, _mode)
+  && GET_MODE_SIZE (int_mode) > 1
   && REG_VALUES (REGNO (x)) != NULL
   && (!cselib_current_insn || !DEBUG_INSN_P (cselib_current_insn)))
 {
-- 
2.37.3

Re: [PATCH 2/2] var-tracking: Add entry values up to max register mode

2022-09-26 Thread Stefan Schulze Frielinghaus via Gcc-patches

Ping.

On Wed, Sep 07, 2022 at 04:20:26PM +0200, Stefan Schulze Frielinghaus wrote:
> For parameter of type integer which do not consume a whole register
> (modulo sign/zero extension) this patch adds entry values up to maximal
> register mode.
> 
> gcc/ChangeLog:
> 
>   * var-tracking.cc (vt_add_function_parameter): Add entry values
>   up to maximal register mode.
> ---
>  gcc/var-tracking.cc | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/gcc/var-tracking.cc b/gcc/var-tracking.cc
> index 235981d100f..9c40ec4fb8b 100644
> --- a/gcc/var-tracking.cc
> +++ b/gcc/var-tracking.cc
> @@ -9906,6 +9906,23 @@ vt_add_function_parameter (tree parm)
>VAR_INIT_STATUS_INITIALIZED, NULL, INSERT);
>   }
>   }
> +
> +   if (GET_MODE_CLASS (mode) == MODE_INT)
> + {
> +   machine_mode wider_mode_iter;
> +   FOR_EACH_WIDER_MODE (wider_mode_iter, mode)
> + {
> +   if (!HWI_COMPUTABLE_MODE_P (wider_mode_iter))
> + break;
> +   rtx wider_reg
> + = gen_rtx_REG (wider_mode_iter, REGNO (incoming));
> +   cselib_val *wider_val
> + = cselib_lookup_from_insn (wider_reg, wider_mode_iter, 1,
> +VOIDmode, get_insns ());
> +   preserve_value (wider_val);
> +   record_entry_value (wider_val, wider_reg);
> + }
> + }
>   }
>  }
>else if (GET_CODE (incoming) == PARALLEL && !dv_onepart_p (dv))
> -- 
> 2.37.2
>

Re: [PATCH 1/2] cselib: Keep track of further subvalue relations

2022-09-26 Thread Stefan Schulze Frielinghaus via Gcc-patches

Ping.

On Wed, Sep 07, 2022 at 04:20:25PM +0200, Stefan Schulze Frielinghaus wrote:
> Whenever a new cselib value is created check whether a smaller value
> exists which is contained in the bigger one.  If so add a subreg
> relation to locs of the smaller one.
> 
> gcc/ChangeLog:
> 
>   * cselib.cc (new_cselib_val): Keep track of further subvalue
>   relations.
> ---
>  gcc/cselib.cc | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/gcc/cselib.cc b/gcc/cselib.cc
> index 6a5609786fa..9b582e5d3d6 100644
> --- a/gcc/cselib.cc
> +++ b/gcc/cselib.cc
> @@ -1569,6 +1569,26 @@ new_cselib_val (unsigned int hash, machine_mode mode, 
> rtx x)
>e->locs = 0;
>e->next_containing_mem = 0;
>  
> +  scalar_int_mode int_mode;
> +  if (REG_P (x) && is_int_mode (mode, _mode)
> +  && REG_VALUES (REGNO (x)) != NULL
> +  && (!cselib_current_insn || !DEBUG_INSN_P (cselib_current_insn)))
> +{
> +  rtx copy = shallow_copy_rtx (x);
> +  scalar_int_mode narrow_mode_iter;
> +  FOR_EACH_MODE_UNTIL (narrow_mode_iter, int_mode)
> + {
> +   PUT_MODE_RAW (copy, narrow_mode_iter);
> +   cselib_val *v = cselib_lookup (copy, narrow_mode_iter, 0, VOIDmode);
> +   if (v)
> + {
> +   rtx sub = lowpart_subreg (narrow_mode_iter, e->val_rtx, int_mode);
> +   if (sub)
> + new_elt_loc_list (v, sub);
> + }
> + }
> +}
> +
>if (dump_file && (dump_flags & TDF_CSELIB))
>  {
>fprintf (dump_file, "cselib value %u:%u ", e->uid, hash);
> -- 
> 2.37.2
>

[PATCH 2/2] var-tracking: Add entry values up to max register mode

2022-09-07 Thread Stefan Schulze Frielinghaus via Gcc-patches

For parameter of type integer which do not consume a whole register
(modulo sign/zero extension) this patch adds entry values up to maximal
register mode.

gcc/ChangeLog:

* var-tracking.cc (vt_add_function_parameter): Add entry values
up to maximal register mode.
---
 gcc/var-tracking.cc | 17 +
 1 file changed, 17 insertions(+)

diff --git a/gcc/var-tracking.cc b/gcc/var-tracking.cc
index 235981d100f..9c40ec4fb8b 100644
--- a/gcc/var-tracking.cc
+++ b/gcc/var-tracking.cc
@@ -9906,6 +9906,23 @@ vt_add_function_parameter (tree parm)
 VAR_INIT_STATUS_INITIALIZED, NULL, INSERT);
}
}
+
+ if (GET_MODE_CLASS (mode) == MODE_INT)
+   {
+ machine_mode wider_mode_iter;
+ FOR_EACH_WIDER_MODE (wider_mode_iter, mode)
+   {
+ if (!HWI_COMPUTABLE_MODE_P (wider_mode_iter))
+   break;
+ rtx wider_reg
+   = gen_rtx_REG (wider_mode_iter, REGNO (incoming));
+ cselib_val *wider_val
+   = cselib_lookup_from_insn (wider_reg, wider_mode_iter, 1,
+  VOIDmode, get_insns ());
+ preserve_value (wider_val);
+ record_entry_value (wider_val, wider_reg);
+   }
+   }
}
 }
   else if (GET_CODE (incoming) == PARALLEL && !dv_onepart_p (dv))
-- 
2.37.2

[PATCH 1/2] cselib: Keep track of further subvalue relations

2022-09-07 Thread Stefan Schulze Frielinghaus via Gcc-patches

Whenever a new cselib value is created check whether a smaller value
exists which is contained in the bigger one.  If so add a subreg
relation to locs of the smaller one.

gcc/ChangeLog:

* cselib.cc (new_cselib_val): Keep track of further subvalue
relations.
---
 gcc/cselib.cc | 20 
 1 file changed, 20 insertions(+)

diff --git a/gcc/cselib.cc b/gcc/cselib.cc
index 6a5609786fa..9b582e5d3d6 100644
--- a/gcc/cselib.cc
+++ b/gcc/cselib.cc
@@ -1569,6 +1569,26 @@ new_cselib_val (unsigned int hash, machine_mode mode, 
rtx x)
   e->locs = 0;
   e->next_containing_mem = 0;
 
+  scalar_int_mode int_mode;
+  if (REG_P (x) && is_int_mode (mode, _mode)
+  && REG_VALUES (REGNO (x)) != NULL
+  && (!cselib_current_insn || !DEBUG_INSN_P (cselib_current_insn)))
+{
+  rtx copy = shallow_copy_rtx (x);
+  scalar_int_mode narrow_mode_iter;
+  FOR_EACH_MODE_UNTIL (narrow_mode_iter, int_mode)
+   {
+ PUT_MODE_RAW (copy, narrow_mode_iter);
+ cselib_val *v = cselib_lookup (copy, narrow_mode_iter, 0, VOIDmode);
+ if (v)
+   {
+ rtx sub = lowpart_subreg (narrow_mode_iter, e->val_rtx, int_mode);
+ if (sub)
+   new_elt_loc_list (v, sub);
+   }
+   }
+}
+
   if (dump_file && (dump_flags & TDF_CSELIB))
 {
   fprintf (dump_file, "cselib value %u:%u ", e->uid, hash);
-- 
2.37.2

[PATCH 0/2] Variable tracking and subvalues

2022-09-07 Thread Stefan Schulze Frielinghaus via Gcc-patches

For variable tracking there exists cases where a value is moved in a
wider mode than its native mode.  For example:

int
foo (int x)
{
  bar (x);
  return x;
}

compiles on IBM zSystems into

0x010012b0 <+0>: stmg%r12,%r15,96(%r15)
0x010012b6 <+6>: lgr %r12,%r2
0x010012ba <+10>:lay %r15,-160(%r15)
0x010012c0 <+16>:brasl   %r14,0x10012a0 
0x010012c6 <+22>:lgr %r2,%r12
0x010012ca <+26>:lmg %r12,%r15,256(%r15)
0x010012d0 <+32>:br  %r14

Initially variable x with SImode is held in register r2 which is moved
to call-saved register r12 with DImode from where it is also restored.
The cselib patch records that the initial value held in r2 is a subvalue
of r12 which enables var-tracking to emit a register location entry:

(gdb) info address x
Symbol "x" is multi-location:
  Base address 0x10012b0  Range 0x10012b0-0x10012c5: a variable in $r2
  Range 0x10012c5-0x10012d0: a variable in $r12
  Range 0x10012d0-0x10012d2: a variable in $r2

However, this only works for straight-line programs and fails e.g. for

__attribute__((noinline, noclone)) void
fn1 (int x)
{
  __asm volatile ("" : "+r" (x) : : "memory");
}

__attribute__((noinline, noclone)) int
fn2 (int x, int y)
{
  if (x)
{
  fn1 (x);  // (*)
  fn1 (x);
}
  return y;
}

__attribute__((noinline, noclone)) int
fn3 (int x, int y)
{
  return fn2 (x, y);
}

int
main ()
{
  fn3 (36, 25);
  return 0;
}

At (*) variable x is moved into a call-saved register.  However, the
former cselib patch does not cover this since cselib flushes its tables
across jumps.  In order to not give up entirely by the second patch an
entry value is referred.

In summary this fixes the following guality tests for IBM zSystems:

Fixed by cselib patch:
FAIL: gcc.dg/guality/pr43051-1.c   -O1  -DPREVENT_OPTIMIZATION  line 35 v == 1
FAIL: gcc.dg/guality/pr43051-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  -DPREVENT_OPTIMIZATION line 35 v == 1
FAIL: gcc.dg/guality/pr43051-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  -DPREVENT_OPTIMIZATION line 40 v == 1
FAIL: gcc.dg/guality/pr43051-1.c   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  -DPREVENT_OPTIMIZATION  line 35 v == 1
FAIL: gcc.dg/guality/pr43051-1.c   -O2  -DPREVENT_OPTIMIZATION  line 40 v == 1
FAIL: gcc.dg/guality/pr43051-1.c   -O2 -flto -fno-use-linker-plugin 
-flto-partition=none  -DPREVENT_OPTIMIZATION line 40 v == 1
FAIL: gcc.dg/guality/pr43051-1.c   -O3 -g  -DPREVENT_OPTIMIZATION  line 35 v == 
1
FAIL: gcc.dg/guality/pr43051-1.c   -O3 -g  -DPREVENT_OPTIMIZATION  line 40 v == 
1
FAIL: gcc.dg/guality/pr43051-1.c   -O2  -DPREVENT_OPTIMIZATION  line 35 v == 1
FAIL: gcc.dg/guality/pr43051-1.c   -O2 -flto -fno-use-linker-plugin 
-flto-partition=none  -DPREVENT_OPTIMIZATION line 35 v == 1
FAIL: gcc.dg/guality/pr43051-1.c   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  -DPREVENT_OPTIMIZATION  line 40 v == 1
FAIL: gcc.dg/guality/pr43051-1.c  -Og -DPREVENT_OPTIMIZATION  line 35 v == 1
FAIL: gcc.dg/guality/pr43051-1.c  -Og -DPREVENT_OPTIMIZATION  line 40 v == 1
FAIL: gcc.dg/guality/pr43051-1.c   -O1  -DPREVENT_OPTIMIZATION  line 40 v == 1

Fixed by var-tracking patch:
FAIL: gcc.dg/guality/pr54519-1.c   -O2  -DPREVENT_OPTIMIZATION  line 23 x == 98
FAIL: gcc.dg/guality/pr54519-5.c  -Og -DPREVENT_OPTIMIZATION  line 17 y == 25
FAIL: gcc.dg/guality/pr54519-1.c   -O2 -flto -fno-use-linker-plugin 
-flto-partition=none  -DPREVENT_OPTIMIZATION line 20 x == 36
FAIL: gcc.dg/guality/pr54519-1.c  -Og -DPREVENT_OPTIMIZATION  line 20 x == 36
FAIL: gcc.dg/guality/pr54519-1.c   -O2 -flto -fno-use-linker-plugin 
-flto-partition=none  -DPREVENT_OPTIMIZATION line 23 x == 98
FAIL: gcc.dg/guality/pr54551.c   -O2 -flto -fno-use-linker-plugin 
-flto-partition=none  -DPREVENT_OPTIMIZATION line 18 a == 4
FAIL: gcc.dg/guality/pr54551.c   -O3 -g  -DPREVENT_OPTIMIZATION  line 18 a == 4
FAIL: gcc.dg/guality/pr54519-1.c   -O1  -DPREVENT_OPTIMIZATION  line 23 x == 98
FAIL: gcc.dg/guality/pr54551.c   -Os  -DPREVENT_OPTIMIZATION  line 18 a == 4
FAIL: gcc.dg/guality/pr54519-5.c   -O1  -DPREVENT_OPTIMIZATION  line 17 y == 25
FAIL: gcc.dg/guality/pr54519-1.c   -O1  -DPREVENT_OPTIMIZATION  line 20 x == 36
FAIL: gcc.dg/guality/pr54519-4.c   -Os  -DPREVENT_OPTIMIZATION  line 17 y == 25
FAIL: gcc.dg/guality/pr54519-3.c   -O3 -g  -DPREVENT_OPTIMIZATION  line 23 z == 
8
FAIL: gcc.dg/guality/pr54551.c   -O2  -DPREVENT_OPTIMIZATION  line 18 a == 4
FAIL: gcc.dg/guality/pr54551.c   -O1  -DPREVENT_OPTIMIZATION  line 18 a == 4
FAIL: gcc.dg/guality/pr54519-1.c   -O2  -DPREVENT_OPTIMIZATION  line 20 x == 36
FAIL: gcc.dg/guality/pr54693-2.c   -Os  -DPREVENT_OPTIMIZATION  line 21 x == 10 
- i
FAIL: gcc.dg/guality/pr54519-5.c   -O2  -DPREVENT_OPTIMIZATION  line 17 y == 25
FAIL: gcc.dg/guality/pr54519-3.c   -Os  -DPREVENT_OPTIMIZATION  line 20 z == 6
FAIL:

Re: [PATCH] IBM zSystems: Fix function_ok_for_sibcall [PR106355]

2022-08-24 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Wed, Aug 17, 2022 at 01:50:45PM +0200, Stefan Schulze Frielinghaus wrote:
> For a parameter with BLKmode we cannot use REG_NREGS in order to
> determine the number of consecutive registers.  Streamlined this with
> the implementation of s390_function_arg.
> 
> Fix some indentation whitespace, too.
> 
> Assuming bootstrap and regtest are ok for mainline and gcc-{10,11,12},
> ok to install for all of those?

Meanwhile bootstrap and regtest ran successfully for all branches.

[PATCH] Add further FOR_EACH_ macros

2022-08-17 Thread Stefan Schulze Frielinghaus via Gcc-patches

For my current use case only some FOR_EACH_MODE macros were missing.
Though I thought I will give it a try and grep'ed through the source
code and added further ones.  I didn't manually check all of them but so
far it looks good to me.

Ok for mainline?

contrib/ChangeLog:

* clang-format: Add further FOR_EACH_ macros.
---
 contrib/clang-format | 63 
 1 file changed, 63 insertions(+)

diff --git a/contrib/clang-format b/contrib/clang-format
index ceb5c1d524f..57cec1e6947 100644
--- a/contrib/clang-format
+++ b/contrib/clang-format
@@ -63,17 +63,33 @@ ForEachMacros: [
 'FOR_BB_INSNS_SAFE',
 'FOR_BODY',
 'FOR_COND',
+'FOR_EACH_2XWIDER_MODE',
+'FOR_EACH_ACTUAL_CHILD',
 'FOR_EACH_AGGR_INIT_EXPR_ARG',
 'FOR_EACH_ALIAS',
 'FOR_EACH_ALLOCNO',
+'FOR_EACH_ALLOCNO_CONFLICT',
+'FOR_EACH_ALLOCNO_IN_ALLOCNO_SET',
 'FOR_EACH_ALLOCNO_OBJECT',
 'FOR_EACH_ARTIFICIAL_DEF',
 'FOR_EACH_ARTIFICIAL_USE',
+'FOR_EACH_BB',
 'FOR_EACH_BB_FN',
+'FOR_EACH_BB_IN_BITMAP',
+'FOR_EACH_BB_IN_BITMAP_REV',
+'FOR_EACH_BB_IN_REGION',
+'FOR_EACH_BB_IN_SBITMAP',
+'FOR_EACH_BB_REVERSE',
 'FOR_EACH_BB_REVERSE_FN',
+'FOR_EACH_BB_REVERSE_IN_REGION',
 'FOR_EACH_BIT_IN_MINMAX_SET',
+'FOR_EACH_BSI_IN_REVERSE',
 'FOR_EACH_CALL_EXPR_ARG',
+'FOR_EACH_CHILD',
 'FOR_EACH_CLONE',
+'FOR_EACH_CODE_MAPPING',
+'FOR_EACH_COND_FN_PAIR',
+'FOR_EACH_CONFLICT',
 'FOR_EACH_CONST_CALL_EXPR_ARG',
 'FOR_EACH_CONSTRUCTOR_ELT',
 'FOR_EACH_CONSTRUCTOR_VALUE',
@@ -83,16 +99,27 @@ ForEachMacros: [
 'FOR_EACH_DEFINED_SYMBOL',
 'FOR_EACH_DEFINED_VARIABLE',
 'FOR_EACH_DEP',
+'FOR_EACH_DEP_LINK',
 'FOR_EACH_EDGE',
+'FOR_EACH_ELEMENT',
+'FOR_EACH_ELIM_GRAPH_PRED',
+'FOR_EACH_ELIM_GRAPH_SUCC',
 'FOR_EACH_EXPR',
 'FOR_EACH_EXPR_1',
+'FOR_EACH_EXPR_ID_IN_SET',
+'FOR_EACH_FLOAT_OPERATOR',
+'FOR_EACH_FP_TYPE',
 'FOR_EACH_FUNCTION',
 'FOREACH_FUNCTION_ARGS',
 'FOREACH_FUNCTION_ARGS_PTR',
 'FOR_EACH_FUNCTION_WITH_GIMPLE_BODY',
+'FOR_EACH_GORI_EXPORT_NAME',
+'FOR_EACH_GORI_IMPORT_NAME',
 'FOR_EACH_HASH_TABLE_ELEMENT',
+'FOR_EACH_HTAB_ELEMENT',
 'FOR_EACH_IMM_USE_FAST',
 'FOR_EACH_IMM_USE_ON_STMT',
+'FOR_EACH_IMM_USE_SAFE',
 'FOR_EACH_IMM_USE_STMT',
 'FOR_EACH_INSN',
 'FOR_EACH_INSN_1',
@@ -103,32 +130,68 @@ ForEachMacros: [
 'FOR_EACH_INSN_INFO_MW',
 'FOR_EACH_INSN_INFO_USE',
 'FOR_EACH_INSN_USE',
+'FOR_EACH_INT_OPERATOR',
+'FOR_EACH_INT_TYPE',
+'FOR_EACH_INV',
+'FOR_EACH_LOAD_BROADCAST',
+'FOR_EACH_LOAD_BROADCAST_IMM',
 'FOR_EACH_LOCAL_DECL',
+'FOR_EACH_LOG_LINK',
 'FOR_EACH_LOOP',
+'FOR_EACH_LOOP_BREAK',
 'FOR_EACH_LOOP_FN',
+'FOR_EACH_MODE',
+'FOR_EACH_MODE_FROM',
+'FOR_EACH_MODE_IN_CLASS',
+'FOR_EACH_MODE_UNTIL',
+'FOR_EACH_NEST_INFO',
 'FOR_EACH_OBJECT',
 'FOR_EACH_OBJECT_CONFLICT',
+'FOR_EACH_OP',
+'FOR_EACH_PARTITION_PAIR',
+'FOR_EACH_PHI',
 'FOR_EACH_PHI_ARG',
 'FOR_EACH_PHI_OR_STMT_DEF',
 'FOR_EACH_PHI_OR_STMT_USE',
+'FOR_EACH_POP',
 'FOR_EACH_PREF',
+'FOR_EACH_REF',
+'FOR_EACH_REFERENCED_VAR',
+'FOR_EACH_REFERENCED_VAR_IN_BITMAP',
+'FOR_EACH_REFERENCED_VAR_SAFE',
+'FOR_EACH_REF_REV',
+'FOR_EACH_REGNO',
 'FOR_EACH_SCALAR',
+'FOR_EACH_SIGNED_TYPE',
+'FOR_EACH_SSA',
 'FOR_EACH_SSA_DEF_OPERAND',
+'FOR_EACH_SSA_MAYDEF_OPERAND',
+'FOR_EACH_SSA_MUST_AND_MAY_DEF_OPERAND',
+'FOR_EACH_SSA_MUSTDEF_OPERAND',
+'FOR_EACH_SSA_NAME',
 'FOR_EACH_SSA_TREE_OPERAND',
 'FOR_EACH_SSA_USE_OPERAND',
+'FOR_EACH_SSA_VDEF_OPERAND',
 'FOR_EACH_STATIC_INITIALIZER',
+'FOR_EACH_STATIC_VARIABLE',
+'FOR_EACH_STMT_IN_REVERSE',
+'FOR_EACH_SUBINSN',
 'FOR_EACH_SUBRTX',
 'FOR_EACH_SUBRTX_PTR',
 'FOR_EACH_SUBRTX_VAR',
 'FOR_EACH_SUCC',
 'FOR_EACH_SUCC_1',
 'FOR_EACH_SYMBOL',
+'FOR_EACH_TYPE',
+'FOR_EACH_UNSIGNED_TYPE',
+'FOR_EACH_VALUE_ID_IN_SET',
 'FOR_EACH_VARIABLE',
 'FOR_EACH_VEC_ELT',
 'FOR_EACH_VEC_ELT_FROM',
 'FOR_EACH_VEC_ELT_REVERSE',
 'FOR_EACH_VEC_SAFE_ELT',
 'FOR_EACH_VEC_SAFE_ELT_REVERSE',
+'FOR_EACH_WIDER_MODE',
 'FOR_EXPR',
 'FOR_INIT_STMT',
 'FOR_SCOPE'
-- 
2.35.3

[PATCH] IBM zSystems: Fix function_ok_for_sibcall [PR106355]

2022-08-17 Thread Stefan Schulze Frielinghaus via Gcc-patches

For a parameter with BLKmode we cannot use REG_NREGS in order to
determine the number of consecutive registers.  Streamlined this with
the implementation of s390_function_arg.

Fix some indentation whitespace, too.

Assuming bootstrap and regtest are ok for mainline and gcc-{10,11,12},
ok to install for all of those?

PR target/106355

gcc/ChangeLog:

* config/s390/s390.cc (s390_call_saved_register_used): For a
parameter with BLKmode fix determining number of consecutive
registers.

gcc/testsuite/ChangeLog:

* gcc.target/s390/pr106355.h: Common code for new tests.
* gcc.target/s390/pr106355-1.c: New test.
* gcc.target/s390/pr106355-2.c: New test.
* gcc.target/s390/pr106355-3.c: New test.
---
 gcc/config/s390/s390.cc| 47 +++---
 gcc/testsuite/gcc.target/s390/pr106355-1.c | 42 +++
 gcc/testsuite/gcc.target/s390/pr106355-2.c |  8 
 gcc/testsuite/gcc.target/s390/pr106355-3.c |  8 
 gcc/testsuite/gcc.target/s390/pr106355.h   | 18 +
 5 files changed, 100 insertions(+), 23 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/s390/pr106355-1.c
 create mode 100644 gcc/testsuite/gcc.target/s390/pr106355-2.c
 create mode 100644 gcc/testsuite/gcc.target/s390/pr106355-3.c
 create mode 100644 gcc/testsuite/gcc.target/s390/pr106355.h

diff --git a/gcc/config/s390/s390.cc b/gcc/config/s390/s390.cc
index 5aaf76a9490..85e5b2cb2a2 100644
--- a/gcc/config/s390/s390.cc
+++ b/gcc/config/s390/s390.cc
@@ -13712,36 +13712,37 @@ s390_call_saved_register_used (tree call_expr)
   function_arg_info arg (TREE_TYPE (parameter), /*named=*/true);
   apply_pass_by_reference_rules (_v, arg);
 
-   parm_rtx = s390_function_arg (cum, arg);
+  parm_rtx = s390_function_arg (cum, arg);
 
-   s390_function_arg_advance (cum, arg);
+  s390_function_arg_advance (cum, arg);
 
-   if (!parm_rtx)
-continue;
-
-   if (REG_P (parm_rtx))
-{
-  for (reg = 0; reg < REG_NREGS (parm_rtx); reg++)
-if (!call_used_or_fixed_reg_p (reg + REGNO (parm_rtx)))
-  return true;
-}
+  if (!parm_rtx)
+   continue;
 
-   if (GET_CODE (parm_rtx) == PARALLEL)
-{
-  int i;
+  if (REG_P (parm_rtx))
+   {
+ int size = s390_function_arg_size (arg.mode, arg.type);
+ int nregs = (size + UNITS_PER_LONG - 1) / UNITS_PER_LONG;
 
-  for (i = 0; i < XVECLEN (parm_rtx, 0); i++)
-{
-  rtx r = XEXP (XVECEXP (parm_rtx, 0, i), 0);
+ for (reg = 0; reg < nregs; reg++)
+   if (!call_used_or_fixed_reg_p (reg + REGNO (parm_rtx)))
+ return true;
+   }
+  else if (GET_CODE (parm_rtx) == PARALLEL)
+   {
+ int i;
 
-  gcc_assert (REG_P (r));
+ for (i = 0; i < XVECLEN (parm_rtx, 0); i++)
+   {
+ rtx r = XEXP (XVECEXP (parm_rtx, 0, i), 0);
 
-  for (reg = 0; reg < REG_NREGS (r); reg++)
-if (!call_used_or_fixed_reg_p (reg + REGNO (r)))
-  return true;
-}
-}
+ gcc_assert (REG_P (r));
+ gcc_assert (REG_NREGS (r) == 1);
 
+ if (!call_used_or_fixed_reg_p (REGNO (r)))
+   return true;
+   }
+   }
 }
   return false;
 }
diff --git a/gcc/testsuite/gcc.target/s390/pr106355-1.c 
b/gcc/testsuite/gcc.target/s390/pr106355-1.c
new file mode 100644
index 000..1ec0f6b25ac
--- /dev/null
+++ b/gcc/testsuite/gcc.target/s390/pr106355-1.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-foptimize-sibling-calls" } */
+/* { dg-final { scan-assembler {brasl\t%r\d+,bar4} } } */
+/* { dg-final { scan-assembler {brasl\t%r\d+,bar8} } } */
+
+/* Parameter E is passed in GPR 6 which is call-saved which prohibits
+   sibling call optimization.  This must hold true also if the mode of the
+   parameter is BLKmode.  */
+
+/* 4 byte */
+
+typedef struct
+{
+  char x;
+  char y[3];
+} t4;
+
+extern t4 e4;
+
+extern void bar4 (int a, int b, int c, int d, t4 e4);
+
+void foo4 (int a, int b, int c, int d)
+{
+  bar4 (a, b, c, d, e4);
+}
+
+/* 8 byte */
+
+typedef struct
+{
+  short x;
+  char y[6];
+} t8;
+
+extern t8 e8;
+
+extern void bar8 (int a, int b, int c, int d, t8 e8);
+
+void foo8 (int a, int b, int c, int d)
+{
+  bar8 (a, b, c, d, e8);
+}
diff --git a/gcc/testsuite/gcc.target/s390/pr106355-2.c 
b/gcc/testsuite/gcc.target/s390/pr106355-2.c
new file mode 100644
index 000..ddbdba5d278
--- /dev/null
+++ b/gcc/testsuite/gcc.target/s390/pr106355-2.c
@@ -0,0 +1,8 @@
+/* { dg-do compile { target { s390-*-* } } } */
+/* { dg-options "-foptimize-sibling-calls -mzarch" } */
+/* { dg-final { scan-assembler {brasl\t%r\d+,bar} } } */
+
+/* This tests function s390_call_saved_register_used where
+   GET_CODE (parm_rtx) == PARALLEL holds.  */
+
+#include "pr106355.h"
diff --git

Re: PING [PATCH 0/4] Use pointer arithmetic for array references [PR102043]

2022-05-02 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Tue, Apr 26, 2022 at 04:40:33PM +0200, Hans-Peter Nilsson wrote:
> > From: Thomas Koenig via Gcc-patches 
> > Date: Fri, 22 Apr 2022 15:59:45 +0200
> 
> > Hi Mikael,
> > 
> > > Ping for the four patches starting at 
> > > https://gcc.gnu.org/pipermail/fortran/2022-April/057759.html :
> > > https://gcc.gnu.org/pipermail/fortran/2022-April/057757.html
> > > https://gcc.gnu.org/pipermail/fortran/2022-April/057760.html
> > > https://gcc.gnu.org/pipermail/fortran/2022-April/057758.html
> > > https://gcc.gnu.org/pipermail/fortran/2022-April/057761.html
> > > 
> > > Richi accepted the general direction and the middle-end interaction.
> > > I need a fortran frontend ack as well.
> > 
> > Looks good to me.
> > 
> > Thanks a lot for taking this on! This would have been a serious
> > regression if released with gcc 12.
> > 
> > Best regards
> > 
> > Thomas
> 
> These, or specifically r12-8227-g89ca0fffa48b79, "fortran:
> Pre-evaluate string pointers. [PR102043]" have further
> exposed (the issue existed before but now fails for more
> platforms) PR78054 "gfortran.dg/pr70673.f90 FAILs at -O0",
> at least for cris-elf and apparently also
> s390x-ibm-linux-gnu.
> 
> In the PR it is mentioned that running the test through
> valgrind shows invalid accesses also on x86_64-linux-gnu.
> Could it be that the test-case is invalid and has undefined
> behavior?  I don't know fortran so I can't tell.
> 
> That exact commit causing a regression for s390x is somewhat
> an assumption based on posted date and testresults, as the
> s390x results don't include a git version.  (@Stefansf: I'm
> referring to
> https://gcc.gnu.org/pipermail/gcc-testresults/2022-April/760060.html
> https://gcc.gnu.org/pipermail/gcc-testresults/2022-April/760137.html
> Perhaps that tester isn't using the contrib/gcc_update and
> contrib/test_summary scripts, thus no LAST_UPDATED
> included?)

Indeed the reports don't include a git commit id.  We are using both
scripts.  However, since the git repository is setup differently in our
case, we had been using `gcc_update --touch` only.  Thus the files
LAST_UPDATED as well as gcc/REVISION were not created.  I changed that
such that both are created, now.  Thanks for letting me know!

Cheers,
Stefan

[PATCH] IBM Z: Fix load-and-test peephole2 condition

2021-11-19 Thread Stefan Schulze Frielinghaus via Gcc-patches

For a peephole2 condition variable insn points to the first matched
insn.  In order to refer to the second matched insn use
peep2_next_insn(1) instead.

Update: Added a test case

Bootstrapped and regtested on IBM Z.  Ok for mainline and gcc-{11,10,9}?

gcc/ChangeLog:

* config/s390/s390.md (define_peephole2): Variable insn points
to the first matched insn.  Use peep2_next_insn(1) to refer to
the second matched insn.

gcc/testsuite/ChangeLog:

* gcc.target/s390/2029.c: New test.
---
 gcc/config/s390/s390.md  |  2 +-
 gcc/testsuite/gcc.target/s390/2029.c | 12 
 2 files changed, 13 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/s390/2029.c

diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index 4debdcd1247..c4f92bde061 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -1003,7 +1003,7 @@
(match_operand:GPR 2 "memory_operand"))
(set (reg CC_REGNUM)
(compare (match_dup 0) (match_operand:GPR 1 "const0_operand")))]
-  "s390_match_ccmode(insn, CCSmode) && TARGET_EXTIMM
+  "s390_match_ccmode (peep2_next_insn (1), CCSmode) && TARGET_EXTIMM
&& GENERAL_REG_P (operands[0])
&& satisfies_constraint_T (operands[2])
&& !contains_constant_pool_address_p (operands[2])"
diff --git a/gcc/testsuite/gcc.target/s390/2029.c 
b/gcc/testsuite/gcc.target/s390/2029.c
new file mode 100644
index 000..1a6df4f4b89
--- /dev/null
+++ b/gcc/testsuite/gcc.target/s390/2029.c
@@ -0,0 +1,12 @@
+/* { dg-do run } */
+/* { dg-options "-Os -march=z10" } */
+signed char a;
+int b = -925974181, c;
+unsigned *d = 
+int *e = 
+int main() {
+  *e = ((217 ^ a) > 585) < *d;
+  if (c != 1)
+__builtin_abort();
+  return 0;
+}
-- 
2.31.1

[PATCH] IBM Z: Fix load-and-test peephole2 condition

2021-11-18 Thread Stefan Schulze Frielinghaus via Gcc-patches

For a peephole2 condition variable insn points to the first matched
insn.  In order to refer to the second matched insn use
peep2_next_insn(1) instead.

Bootstrapped and regtested on IBM Z.  Ok for mainline and gcc-{11,10,9}?

gcc/ChangeLog:

* config/s390/s390.md (define_peephole2): Variable insn points
to the first matched insn.  Use peep2_next_insn(1) to refer to
the second matched insn.
---
 gcc/config/s390/s390.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index 4debdcd1247..c4f92bde061 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -1003,7 +1003,7 @@
(match_operand:GPR 2 "memory_operand"))
(set (reg CC_REGNUM)
(compare (match_dup 0) (match_operand:GPR 1 "const0_operand")))]
-  "s390_match_ccmode(insn, CCSmode) && TARGET_EXTIMM
+  "s390_match_ccmode (peep2_next_insn (1), CCSmode) && TARGET_EXTIMM
&& GENERAL_REG_P (operands[0])
&& satisfies_constraint_T (operands[2])
&& !contains_constant_pool_address_p (operands[2])"
-- 
2.31.1

Re: [PATCH] IBM Z: ldist-{rawmemchr, strlen} tests require vector extensions

2021-11-05 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Tue, Nov 02, 2021 at 04:20:01PM +0100, Andreas Schwab wrote:
> On Nov 02 2021, Stefan Schulze Frielinghaus via Gcc-patches wrote:
> 
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c 
> > b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
> > index 6abfd278351..bf6335f6360 100644
> > --- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
> > @@ -1,5 +1,6 @@
> >  /* { dg-do run { target s390x-*-* } } */
> >  /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } 
> > */
> > +/* { dg-additional-options "-march=z13 -mzarch" { target s390x-*-* } } */
> 
> I think that should use an effective_target check.

Thanks for the hint.  Wasn't aware of those checks.  I replaced all
"target s390x-*-*" checks with "target s390_vx".  The latter tests
whether the toolchain and actual machine is capable to deal with vector
extensions.

Ok for mainline?
>From 86b46ae8cb3c014739f783a88043951c996deb61 Mon Sep 17 00:00:00 2001
From: Stefan Schulze Frielinghaus 
Date: Fri, 5 Nov 2021 09:05:01 +0100
Subject: [PATCH] IBM Z: ldist-{rawmemchr,strlen} tests require vector
 extensions fixup

This is a fixup for 64bf0c835f8918adf7e4140a04ac79c2963204aa.  Using
effective target check s390_vx is more robust e.g. when trying to run
the test on a machine older than z13.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ldist-rawmemchr-1.c: Replace s390x-*-* by
s390_vx.
* gcc.dg/tree-ssa/ldist-rawmemchr-2.c: Likewise.
* gcc.dg/tree-ssa/ldist-strlen-1.c: Likewise.
* gcc.dg/tree-ssa/ldist-strlen-3.c: Likewise.
---
 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c | 10 +-
 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c | 10 +-
 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c|  6 +++---
 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c|  4 ++--
 4 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
index bf6335f6360..8e7f1f868fe 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
@@ -1,9 +1,9 @@
-/* { dg-do run { target s390x-*-* } } */
+/* { dg-do run { target s390_vx } } */
 /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
-/* { dg-additional-options "-march=z13 -mzarch" { target s390x-*-* } } */
-/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { 
target s390x-*-* } } } */
-/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { 
target s390x-*-* } } } */
-/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { 
target s390x-*-* } } } */
+/* { dg-additional-options "-march=z13 -mzarch" { target s390_vx } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { 
target s390_vx } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { 
target s390_vx } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { 
target s390_vx } } } */
 
 /* Rawmemchr pattern: reduction stmt and no store */
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
index 83f5a35a322..0959d4b8f2a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
@@ -1,9 +1,9 @@
-/* { dg-do run { target s390x-*-* } } */
+/* { dg-do run { target s390_vx } } */
 /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
-/* { dg-additional-options "-march=z13 -mzarch" { target s390x-*-* } } */
-/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { 
target s390x-*-* } } } */
-/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { 
target s390x-*-* } } } */
-/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { 
target s390x-*-* } } } */
+/* { dg-additional-options "-march=z13 -mzarch" { target s390_vx } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { 
target s390_vx } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { 
target s390_vx } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { 
target s390_vx } } } */
 
 /* Rawmemchr pattern: reduction stmt and store */
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
index aeb04b91f6b..dff57

[PATCH] IBM Z: Free bbs in s390_loop_unroll_adjust

2021-11-02 Thread Stefan Schulze Frielinghaus via Gcc-patches

Bootstrapped and regtested on IBM Z.  Ok for mainline?

gcc/ChangeLog:

* config/s390/s390.c (s390_loop_unroll_adjust): In case of early
exit free bbs.
---
 gcc/config/s390/s390.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c
index b2f2f6417b3..510e7f58a3b 100644
--- a/gcc/config/s390/s390.c
+++ b/gcc/config/s390/s390.c
@@ -15400,7 +15400,10 @@ s390_loop_unroll_adjust (unsigned nunroll, struct loop 
*loop)
  || (GET_CODE (SET_SRC (set)) == COMPARE
  && GET_MODE (XEXP (SET_SRC (set), 0)) == BLKmode
  && GET_MODE (XEXP (SET_SRC (set), 1)) == BLKmode)))
-   return 1;
+   {
+ free (bbs);
+ return 1;
+   }
 
  FOR_EACH_SUBRTX (iter, array, PATTERN (insn), NONCONST)
if (MEM_P (*iter))
-- 
2.31.1

[PATCH] IBM Z: ldist-{rawmemchr, strlen} tests require vector extensions

2021-11-02 Thread Stefan Schulze Frielinghaus via Gcc-patches

The tests require vector extensions which are only available for z13 and
later while using the z/Architecture.

Bootstrapped and regtested on IBM Z.  Ok for mainline?

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ldist-rawmemchr-1.c: For IBM Z set arch to z13
and use z/Architecture since the tests require vector extensions.
* gcc.dg/tree-ssa/ldist-rawmemchr-2.c: Likewise.
* gcc.dg/tree-ssa/ldist-strlen-1.c: Likewise.
* gcc.dg/tree-ssa/ldist-strlen-3.c: Likewise.
---
 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c | 1 +
 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c | 1 +
 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c| 1 +
 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c| 1 +
 4 files changed, 4 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
index 6abfd278351..bf6335f6360 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
@@ -1,5 +1,6 @@
 /* { dg-do run { target s390x-*-* } } */
 /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-additional-options "-march=z13 -mzarch" { target s390x-*-* } } */
 /* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { 
target s390x-*-* } } } */
 /* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { 
target s390x-*-* } } } */
 /* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { 
target s390x-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
index 00d6ea0f8e9..83f5a35a322 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
@@ -1,5 +1,6 @@
 /* { dg-do run { target s390x-*-* } } */
 /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-additional-options "-march=z13 -mzarch" { target s390x-*-* } } */
 /* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { 
target s390x-*-* } } } */
 /* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { 
target s390x-*-* } } } */
 /* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { 
target s390x-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
index 918b60099e4..aeb04b91f6b 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
@@ -1,5 +1,6 @@
 /* { dg-do run } */
 /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-additional-options "-march=z13 -mzarch" { target s390x-*-* } } */
 /* { dg-final { scan-tree-dump-times "generated strlenQI\n" 4 "ldist" } } */
 /* { dg-final { scan-tree-dump-times "generated strlenHI\n" 4 "ldist" { target 
s390x-*-* } } } */
 /* { dg-final { scan-tree-dump-times "generated strlenSI\n" 4 "ldist" { target 
s390x-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c
index 370fd5eb088..0652857265a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-additional-options "-march=z13 -mzarch" { target s390x-*-* } } */
 /* { dg-final { scan-tree-dump-times "generated strlenSI\n" 1 "ldist" { target 
s390x-*-* } } } */
 
 extern int s[];
-- 
2.31.1

[PATCH] IBM Z: Fix address of operands will never be NULL warnings

2021-10-30 Thread Stefan Schulze Frielinghaus via Gcc-patches

Since a recent enhancement of -Waddress a couple of warnings are emitted
and turned into errors during bootstrap:

gcc/config/s390/s390.md:12087:25: error: the address of 'operands' will never 
be NULL [-Werror=address]
12087 |   "TARGET_HTM && operands != NULL
build/gencondmd.c:59:12: note: 'operands' declared here
   59 | extern rtx operands[];
  |^~~~

Fixed by removing those non-null checks.
Bootstrapped and regtested on IBM Z.  Ok for mainline?

gcc/ChangeLog:

* config/s390/s390.md ("*cc_to_int", "tabort", "*tabort_1",
"*tabort_1_plus"): Remove operands non-null check.
---
 gcc/config/s390/s390.md | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index b8bdbaec468..4debdcd1247 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -3533,7 +3533,7 @@
   [(set (match_operand:SI 0 "nonimmediate_operand" "=d")
 (unspec:SI [(match_operand 1 "register_operand" "0")]
UNSPEC_CC_TO_INT))]
-  "operands != NULL"
+  ""
   "#"
   "reload_completed"
   [(set (match_dup 0) (lshiftrt:SI (match_dup 0) (const_int 28)))])
@@ -12062,7 +12062,7 @@
 (define_expand "tabort"
   [(unspec_volatile [(match_operand:SI 0 "nonmemory_operand" "")]
UNSPECV_TABORT)]
-  "TARGET_HTM && operands != NULL"
+  "TARGET_HTM"
 {
   if (CONST_INT_P (operands[0])
   && INTVAL (operands[0]) >= 0 && INTVAL (operands[0]) <= 255)
@@ -12076,7 +12076,7 @@
 (define_insn "*tabort_1"
   [(unspec_volatile [(match_operand:SI 0 "nonmemory_operand" "aJ")]
UNSPECV_TABORT)]
-  "TARGET_HTM && operands != NULL"
+  "TARGET_HTM"
   "tabort\t%Y0"
   [(set_attr "op_type" "S")])
 
@@ -12084,8 +12084,7 @@
   [(unspec_volatile [(plus:SI (match_operand:SI 0 "register_operand"  "a")
  (match_operand:SI 1 "const_int_operand" "J"))]
UNSPECV_TABORT)]
-  "TARGET_HTM && operands != NULL
-   && CONST_OK_FOR_CONSTRAINT_P (INTVAL (operands[1]), 'J', \"J\")"
+  "TARGET_HTM && CONST_OK_FOR_CONSTRAINT_P (INTVAL (operands[1]), 'J', \"J\")"
   "tabort\t%1(%0)"
   [(set_attr "op_type" "S")])
 
-- 
2.31.1

Re: [PATCH] regcprop: Determine subreg offset depending on endianness [PR101260]

2021-10-29 Thread Stefan Schulze Frielinghaus via Gcc-patches

ping

On Mon, Oct 11, 2021 at 02:14:53PM +0200, Stefan Schulze Frielinghaus wrote:
> On Mon, Oct 11, 2021 at 09:38:36AM +0200, Richard Biener wrote:
> > On Fri, Oct 8, 2021 at 1:31 PM Stefan Schulze Frielinghaus via
> > Gcc-patches  wrote:
> > >
> > > gcc/ChangeLog:
> > >
> > > * regcprop.c (maybe_mode_change): Determine offset relative to
> > > high or low part depending on endianness.
> > >
> > > Bootstrapped and regtested on IBM Z. Ok for mainline and gcc-{11,10,9}?
> > 
> > Is there a testcase to add?
> 
> I've updated the patch and added the testcase from the PR.
> 
> > 
> > > ---
> > >  gcc/regcprop.c | 11 ---
> > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/gcc/regcprop.c b/gcc/regcprop.c
> > > index d2a01130fe1..0e1ac12458a 100644
> > > --- a/gcc/regcprop.c
> > > +++ b/gcc/regcprop.c
> > > @@ -414,9 +414,14 @@ maybe_mode_change (machine_mode orig_mode, 
> > > machine_mode copy_mode,
> > > copy_nregs, _per_reg))
> > > return NULL_RTX;
> > >poly_uint64 copy_offset = bytes_per_reg * (copy_nregs - use_nregs);
> > > -  poly_uint64 offset
> > > -   = subreg_size_lowpart_offset (GET_MODE_SIZE (new_mode) + 
> > > copy_offset,
> > > - GET_MODE_SIZE (orig_mode));
> > > +  poly_uint64 offset =
> > > +#if WORDS_BIG_ENDIAN
> > > +   subreg_size_highpart_offset
> > > +#else
> > > +   subreg_size_lowpart_offset
> > > +#endif
> > > +   (GET_MODE_SIZE (new_mode) + 
> > > copy_offset,
> > > +GET_MODE_SIZE (orig_mode));
> > >regno += subreg_regno_offset (regno, orig_mode, offset, new_mode);
> > >if (targetm.hard_regno_mode_ok (regno, new_mode))
> > > return gen_raw_REG (new_mode, regno);
> > > --
> > > 2.31.1
> > >

> From 299959788321e21c27f0d4a6d437a586c5f6c92e Mon Sep 17 00:00:00 2001
> From: Stefan Schulze Frielinghaus 
> Date: Mon, 4 Oct 2021 09:36:21 +0200
> Subject: [PATCH] regcprop: Determine subreg offset depending on endianness
>  [PR101260]
> 
> gcc/ChangeLog:
> 
>   * regcprop.c (maybe_mode_change): Determine offset relative to
>   high or low part depending on endianness.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/pr101260.c: New test.
> ---
>  gcc/regcprop.c  | 11 ++--
>  gcc/testsuite/gcc.dg/pr101260.c | 49 +
>  2 files changed, 57 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr101260.c
> 
> diff --git a/gcc/regcprop.c b/gcc/regcprop.c
> index d2a01130fe1..0e1ac12458a 100644
> --- a/gcc/regcprop.c
> +++ b/gcc/regcprop.c
> @@ -414,9 +414,14 @@ maybe_mode_change (machine_mode orig_mode, machine_mode 
> copy_mode,
>   copy_nregs, _per_reg))
>   return NULL_RTX;
>poly_uint64 copy_offset = bytes_per_reg * (copy_nregs - use_nregs);
> -  poly_uint64 offset
> - = subreg_size_lowpart_offset (GET_MODE_SIZE (new_mode) + copy_offset,
> -   GET_MODE_SIZE (orig_mode));
> +  poly_uint64 offset =
> +#if WORDS_BIG_ENDIAN
> + subreg_size_highpart_offset
> +#else
> + subreg_size_lowpart_offset
> +#endif
> + (GET_MODE_SIZE (new_mode) + copy_offset,
> +  GET_MODE_SIZE (orig_mode));
>regno += subreg_regno_offset (regno, orig_mode, offset, new_mode);
>if (targetm.hard_regno_mode_ok (regno, new_mode))
>   return gen_raw_REG (new_mode, regno);
> diff --git a/gcc/testsuite/gcc.dg/pr101260.c b/gcc/testsuite/gcc.dg/pr101260.c
> new file mode 100644
> index 000..0e9ec4e203a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr101260.c
> @@ -0,0 +1,49 @@
> +/* PR rtl-optimization/101260 */
> +/* { dg-do run } */
> +/* { dg-options -O1 } */
> +struct a {
> +  unsigned b : 7;
> +  int c;
> +  int d;
> +  short e;
> +} p, *q = 
> +int f, g, h, i, r, s;
> +static short j[8][1][6] = {0};
> +char k[7];
> +short l, m;
> +int *n;
> +int **o = 
> +void t() {
> +  for (; f;)
> +;
> +}
> +static struct a u(int x) {
> +  struct a a = {4, 8, 5, 4};
> +  for (; i <= 6; i++) {
> +struct a v = {0};
> +for (; l; l++)
> +  h = 0;
> +for (; h >= 0; h--) {
> +  struct a *w;
> +  j[i];
> +  w = 
> +  s = 0;
> +  for (; s < 3; s++) {
> +r ^= x;
> +m = j[i][g][h] == (k[g] = g);
> +*w = v;
> +  }
> +  r = 2;
> +  for (; r; r--)
> +*o = 
> +}
> +  }
> +  t();
> +  return a;
> +}
> +int main() {
> +  *q = u(636);
> +  if (p.b != 4)
> +__builtin_abort ();
> +  return 0;
> +}
> -- 
> 2.31.1
>

[COMMITED] tree-optimization/102752: Fix determining precission of reduction_var

2021-10-15 Thread Stefan Schulze Frielinghaus via Gcc-patches

While determining the precission of reduction_var an SSA_NAME instead of
its TREE_TYPE is used.  Streamlined with other TREE_TYPE (reduction_var)
uses.

Bootstrapped and regtested on x86 and IBM Z.  Committed as per PR102752.

gcc/ChangeLog:

* tree-loop-distribution.c (reduction_var_overflows_first):
Pass the type of reduction_var as first argument as it is also
done for the load type.
(loop_distribution::transform_reduction_loop): Add missing
TREE_TYPE while determining precission of reduction_var.
---
 gcc/tree-loop-distribution.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index fb9250031b5..583c01a42d8 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -3425,12 +3425,12 @@ generate_strlen_builtin_using_rawmemchr (loop_p loop, 
tree reduction_var,
 
 /* Return true if we can count at least as many characters by taking pointer
difference as we can count via reduction_var without an overflow.  Thus
-   compute 2^n < (2^(m-1) / s) where n = TYPE_PRECISION (reduction_var),
+   compute 2^n < (2^(m-1) / s) where n = TYPE_PRECISION (reduction_var_type),
m = TYPE_PRECISION (ptrdiff_type_node), and s = size of each character.  */
 static bool
-reduction_var_overflows_first (tree reduction_var, tree load_type)
+reduction_var_overflows_first (tree reduction_var_type, tree load_type)
 {
-  widest_int n2 = wi::lshift (1, TYPE_PRECISION (reduction_var));;
+  widest_int n2 = wi::lshift (1, TYPE_PRECISION (reduction_var_type));;
   widest_int m2 = wi::lshift (1, TYPE_PRECISION (ptrdiff_type_node) - 1);
   widest_int s = wi::to_widest (TYPE_SIZE_UNIT (load_type));
   return wi::ltu_p (n2, wi::udiv_trunc (m2, s));
@@ -3654,6 +3654,7 @@ loop_distribution::transform_reduction_loop (loop_p loop)
   && integer_onep (reduction_iv.step))
 {
   location_t loc = gimple_location (DR_STMT (load_dr));
+  tree reduction_var_type = TREE_TYPE (reduction_var);
   /* While determining the length of a string an overflow might occur.
 If an overflow only occurs in the loop implementation and not in the
 strlen implementation, then either the overflow is undefined or the
@@ -3680,8 +3681,8 @@ loop_distribution::transform_reduction_loop (loop_p loop)
  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node)
  && ((TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
   && TYPE_PRECISION (ptr_type_node) >= 32)
- || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
- && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION 
(sizetype)))
+ || (TYPE_OVERFLOW_UNDEFINED (reduction_var_type)
+ && TYPE_PRECISION (reduction_var_type) <= TYPE_PRECISION 
(sizetype)))
  && builtin_decl_implicit (BUILT_IN_STRLEN))
generate_strlen_builtin (loop, reduction_var, load_iv.base,
 reduction_iv.base, loc);
@@ -3689,8 +3690,8 @@ loop_distribution::transform_reduction_loop (loop_p loop)
   != CODE_FOR_nothing
   && ((TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION 
(ptr_type_node)
&& TYPE_PRECISION (ptrdiff_type_node) >= 32)
-  || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
-  && reduction_var_overflows_first (reduction_var, 
load_type
+  || (TYPE_OVERFLOW_UNDEFINED (reduction_var_type)
+  && reduction_var_overflows_first (reduction_var_type, 
load_type
generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
 load_iv.base,
 load_type,
-- 
2.31.1

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-10-11 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Fri, Sep 17, 2021 at 10:08:27AM +0200, Richard Biener wrote:
> On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> > > On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> > >  wrote:
> > > >
> > > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > > > [...]
> > > > > > >
> > > > > > > +  /* Handle strlen like loops.  */
> > > > > > > +  if (store_dr == NULL
> > > > > > > +  && integer_zerop (pattern)
> > > > > > > +  && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > > > +  && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > > > +  && integer_onep (reduction_iv.step)
> > > > > > > +  && (types_compatible_p (TREE_TYPE (reduction_var), 
> > > > > > > size_type_node)
> > > > > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
> > > > > > > +{
> > > > > > >
> > > > > > > I wonder what goes wrong with a larger or smaller wrapping IV 
> > > > > > > type?
> > > > > > > The iteration
> > > > > > > only stops when you load a NUL and the increments just wrap along 
> > > > > > > (you're
> > > > > > > using the pointer IVs to compute the strlen result).  Can't you 
> > > > > > > simply truncate?
> > > > > >
> > > > > > I think truncation is enough as long as no overflow occurs in 
> > > > > > strlen or
> > > > > > strlen_using_rawmemchr.
> > > > > >
> > > > > > > For larger than size_type_node (actually larger than 
> > > > > > > ptr_type_node would matter
> > > > > > > I guess), the argument is that since pointer wrapping would be 
> > > > > > > undefined anyway
> > > > > > > the IV cannot wrap either.  Now, the correct check here would 
> > > > > > > IMHO be
> > > > > > >
> > > > > > >   TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > > > (ptr_type_node)
> > > > > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > Regarding the implementation which makes use of rawmemchr:
> > > > > >
> > > > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  
> > > > > > Thus,
> > > > > > the maximal length we can determine of a string where each 
> > > > > > character has
> > > > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow 
> > > > > > for
> > > > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > > > occurs, then an overflow occurs for reduction variable, too, and 
> > > > > > that
> > > > > > this is undefined, too.  However, I'm not sure anymore whether we 
> > > > > > want
> > > > > > to respect overflows in all cases.  If TYPE_PRECISION 
> > > > > > (ptr_type_node)
> > > > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, 
> > > > > > then
> > > > > > this would mean that a single string consumes more than half of the
> > > > > > virtual addressable memory.  At least for architectures where
> > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is 
> > > > > > reasonable
> > > > > > to neglect the case where computing pointer difference may overflow.
> > > > > > Otherwise we are talking about strings with lenghts of multiple
> > > > > > pebibytes.  For other architectures we might have to be more precise
> > > > > > and make sure that reduction variable overflows first and that this 
> > > > > > is
> > > > > > undefined.
> > > > > >
> > > > > > Thus a conservative condition would be (I assumed that the size of 
> > > > > > any
> > > > > > integral type is a power of two which I'm not sure if this really 
> > > > > > holds;
> > > > > > IIRC the C standard requires only that the alignment is a power of 
> > > > > > two
> > > > > > but not necessarily the size so I might need to change this):
> > > > > >
> > > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 
> > > > > > 1 - log2 (sizeof (load_type))
> > > > > >or in other words return true if reduction variable overflows 
> > > > > > first
> > > > > >and false otherwise.  */
> > > > > >
> > > > > > static bool
> > > > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > > > {
> > > > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> > > > > > (reduction_var));
> > > > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide 
> > > > > > (TYPE_SIZE_UNIT (load_type)));
> > > > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 
> > > > > > - size_exponent);
> > > > > > }
> > > > > >
> > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > > > && reduction_var_overflows_first (reduction_var, load_type)
> > > > > >
> > > > > > Regarding the implementation which makes use of strlen:
> > > > > >
> > > > > > I'm not sure what it means if strlen is called

Re: [PATCH] regcprop: Determine subreg offset depending on endianness [PR101260]

2021-10-11 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Mon, Oct 11, 2021 at 09:38:36AM +0200, Richard Biener wrote:
> On Fri, Oct 8, 2021 at 1:31 PM Stefan Schulze Frielinghaus via
> Gcc-patches  wrote:
> >
> > gcc/ChangeLog:
> >
> > * regcprop.c (maybe_mode_change): Determine offset relative to
> > high or low part depending on endianness.
> >
> > Bootstrapped and regtested on IBM Z. Ok for mainline and gcc-{11,10,9}?
> 
> Is there a testcase to add?

I've updated the patch and added the testcase from the PR.

> 
> > ---
> >  gcc/regcprop.c | 11 ---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/gcc/regcprop.c b/gcc/regcprop.c
> > index d2a01130fe1..0e1ac12458a 100644
> > --- a/gcc/regcprop.c
> > +++ b/gcc/regcprop.c
> > @@ -414,9 +414,14 @@ maybe_mode_change (machine_mode orig_mode, 
> > machine_mode copy_mode,
> > copy_nregs, _per_reg))
> > return NULL_RTX;
> >poly_uint64 copy_offset = bytes_per_reg * (copy_nregs - use_nregs);
> > -  poly_uint64 offset
> > -   = subreg_size_lowpart_offset (GET_MODE_SIZE (new_mode) + 
> > copy_offset,
> > - GET_MODE_SIZE (orig_mode));
> > +  poly_uint64 offset =
> > +#if WORDS_BIG_ENDIAN
> > +   subreg_size_highpart_offset
> > +#else
> > +   subreg_size_lowpart_offset
> > +#endif
> > +   (GET_MODE_SIZE (new_mode) + copy_offset,
> > +GET_MODE_SIZE (orig_mode));
> >regno += subreg_regno_offset (regno, orig_mode, offset, new_mode);
> >if (targetm.hard_regno_mode_ok (regno, new_mode))
> > return gen_raw_REG (new_mode, regno);
> > --
> > 2.31.1
> >
>From 299959788321e21c27f0d4a6d437a586c5f6c92e Mon Sep 17 00:00:00 2001
From: Stefan Schulze Frielinghaus 
Date: Mon, 4 Oct 2021 09:36:21 +0200
Subject: [PATCH] regcprop: Determine subreg offset depending on endianness
 [PR101260]

gcc/ChangeLog:

* regcprop.c (maybe_mode_change): Determine offset relative to
high or low part depending on endianness.

gcc/testsuite/ChangeLog:

* gcc.dg/pr101260.c: New test.
---
 gcc/regcprop.c  | 11 ++--
 gcc/testsuite/gcc.dg/pr101260.c | 49 +
 2 files changed, 57 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr101260.c

diff --git a/gcc/regcprop.c b/gcc/regcprop.c
index d2a01130fe1..0e1ac12458a 100644
--- a/gcc/regcprop.c
+++ b/gcc/regcprop.c
@@ -414,9 +414,14 @@ maybe_mode_change (machine_mode orig_mode, machine_mode 
copy_mode,
copy_nregs, _per_reg))
return NULL_RTX;
   poly_uint64 copy_offset = bytes_per_reg * (copy_nregs - use_nregs);
-  poly_uint64 offset
-   = subreg_size_lowpart_offset (GET_MODE_SIZE (new_mode) + copy_offset,
- GET_MODE_SIZE (orig_mode));
+  poly_uint64 offset =
+#if WORDS_BIG_ENDIAN
+   subreg_size_highpart_offset
+#else
+   subreg_size_lowpart_offset
+#endif
+   (GET_MODE_SIZE (new_mode) + copy_offset,
+GET_MODE_SIZE (orig_mode));
   regno += subreg_regno_offset (regno, orig_mode, offset, new_mode);
   if (targetm.hard_regno_mode_ok (regno, new_mode))
return gen_raw_REG (new_mode, regno);
diff --git a/gcc/testsuite/gcc.dg/pr101260.c b/gcc/testsuite/gcc.dg/pr101260.c
new file mode 100644
index 000..0e9ec4e203a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr101260.c
@@ -0,0 +1,49 @@
+/* PR rtl-optimization/101260 */
+/* { dg-do run } */
+/* { dg-options -O1 } */
+struct a {
+  unsigned b : 7;
+  int c;
+  int d;
+  short e;
+} p, *q = 
+int f, g, h, i, r, s;
+static short j[8][1][6] = {0};
+char k[7];
+short l, m;
+int *n;
+int **o = 
+void t() {
+  for (; f;)
+;
+}
+static struct a u(int x) {
+  struct a a = {4, 8, 5, 4};
+  for (; i <= 6; i++) {
+struct a v = {0};
+for (; l; l++)
+  h = 0;
+for (; h >= 0; h--) {
+  struct a *w;
+  j[i];
+  w = 
+  s = 0;
+  for (; s < 3; s++) {
+r ^= x;
+m = j[i][g][h] == (k[g] = g);
+*w = v;
+  }
+  r = 2;
+  for (; r; r--)
+*o = 
+}
+  }
+  t();
+  return a;
+}
+int main() {
+  *q = u(636);
+  if (p.b != 4)
+__builtin_abort ();
+  return 0;
+}
-- 
2.31.1

Re: [PATCH] IBM Z: Provide rawmemchr{qi,hi,si} expander

2021-10-08 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Thu, Oct 07, 2021 at 11:16:24AM +0200, Andreas Krebbel wrote:
> On 9/20/21 11:24, Stefan Schulze Frielinghaus wrote:
> > This patch implements the rawmemchr expander as introduced in
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579649.html
> > 
> > Bootstrapped and regtested in conjunction with the patch from above on
> > IBM Z.  Ok for mainline?
> > 
> 
> > From 551362cda54048dc1a51588112f11c070ed52020 Mon Sep 17 00:00:00 2001
> > From: Stefan Schulze Frielinghaus 
> > Date: Mon, 8 Feb 2021 10:35:39 +0100
> > Subject: [PATCH 2/2] IBM Z: Provide rawmemchr{qi,hi,si} expander
> >
> > gcc/ChangeLog:
> >
> > * config/s390/s390-protos.h (s390_rawmemchrqi): Add prototype.
> > (s390_rawmemchrhi): Add prototype.
> > (s390_rawmemchrsi): Add prototype.
> > * config/s390/s390.c (s390_rawmemchr): New function.
> > (s390_rawmemchrqi): New function.
> > (s390_rawmemchrhi): New function.
> > (s390_rawmemchrsi): New function.
> > * config/s390/s390.md (rawmemchr): New expander.
> > (rawmemchr): New expander.
> > * config/s390/vector.md (vec_vfees): Basically a copy of
> > the pattern vfees from vx-builtins.md.
> > * config/s390/vx-builtins.md (*vfees): Remove.
> 
> Thanks! Would it make sense to also extend the strlen and movstr expanders
> we have to support the additional character modes?

For strlen-like loops over non-character arrays the current
implementation in the loop distribution pass uses rawmemchr and
computes pointer difference in order to compute the length.  Thus we get
strlen for free and don't need to reimplement it.

> 
> A few style comments below.
> 
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/s390/rawmemchr-1.c: New test.
> > ---
> >  gcc/config/s390/s390-protos.h   |  4 +
> >  gcc/config/s390/s390.c  | 89 ++
> >  gcc/config/s390/s390.md | 20 +
> >  gcc/config/s390/vector.md   | 26 ++
> >  gcc/config/s390/vx-builtins.md  | 26 --
> >  gcc/testsuite/gcc.target/s390/rawmemchr-1.c | 99 +
> >  6 files changed, 238 insertions(+), 26 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/s390/rawmemchr-1.c
> >
> > diff --git a/gcc/config/s390/s390-protos.h b/gcc/config/s390/s390-protos.h
> > index 4b03c6e99f5..0d9619e8254 100644
> > --- a/gcc/config/s390/s390-protos.h
> > +++ b/gcc/config/s390/s390-protos.h
> > @@ -66,6 +66,10 @@ s390_asm_declare_function_size (FILE *asm_out_file,
> > const char *fnname ATTRIBUTE_UNUSED, tree decl);
> >  #endif
> >
> > +extern void s390_rawmemchrqi(rtx dst, rtx src, rtx pat);
> > +extern void s390_rawmemchrhi(rtx dst, rtx src, rtx pat);
> > +extern void s390_rawmemchrsi(rtx dst, rtx src, rtx pat);
> > +
> >  #ifdef RTX_CODE
> >  extern int s390_extra_constraint_str (rtx, int, const char *);
> >  extern int s390_const_ok_for_constraint_p (HOST_WIDE_INT, int, const char 
> > *);
> > diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c
> > index 54dd6332c3a..1435ce156e2 100644
> > --- a/gcc/config/s390/s390.c
> > +++ b/gcc/config/s390/s390.c
> > @@ -16559,6 +16559,95 @@ s390_excess_precision (enum excess_precision_type 
> > type)
> >  }
> >  #endif
> >
> > +template  > + machine_mode elt_mode,
> > + rtx (*gen_vec_vfees) (rtx, rtx, rtx, rtx)>
> > +static void
> > +s390_rawmemchr(rtx dst, rtx src, rtx pat) {
> 
> I think it would be a bit easier to turn the vec_vfees expander into a
> 'parameterized name' and add the mode as parameter.  I'll attach a patch
> to illustrate how this might look like.

Right, didn't know about parameterized names which looks more clean to
me.  Thanks for the hint!

> 
> > +  rtx lens = gen_reg_rtx (V16QImode);
> > +  rtx pattern = gen_reg_rtx (vec_mode);
> > +  rtx loop_start = gen_label_rtx ();
> > +  rtx loop_end = gen_label_rtx ();
> > +  rtx addr = gen_reg_rtx (Pmode);
> > +  rtx offset = gen_reg_rtx (Pmode);
> > +  rtx tmp = gen_reg_rtx (Pmode);
> > +  rtx loadlen = gen_reg_rtx (SImode);
> > +  rtx matchlen = gen_reg_rtx (SImode);
> > +  rtx mem;
> > +
> > +  pat = GEN_INT (trunc_int_for_mode (INTVAL (pat), elt_mode));
> > +  emit_insn (gen_rtx_SET (pattern, gen_rtx_VEC_DUPLICATE (vec_mode, pat)));
> > +
> > +  emit_move_insn (addr, XEXP (src, 0));
> > +
> > +  // alignment
> > +  emit_insn (gen_vlbb (lens, gen_rtx_MEM (BLKmode, addr), GEN_INT (6)));
> > +  emit_insn (gen_lcbb (loadlen, addr, GEN_INT (6)));
> > +  lens = convert_to_mode (vec_mode, lens, 1);
> > +  emit_insn (gen_vec_vfees (lens, lens, pattern, GEN_INT (0)));
> > +  lens = convert_to_mode (V4SImode, lens, 1);
> > +  emit_insn (gen_vec_extractv4sisi (matchlen, lens, GEN_INT (1)));
> > +  lens = convert_to_mode (vec_mode, lens, 1);
> 
> That back and forth NOP conversion stuff is ugly but I couldn't find a
> more elegant way to write this without generating worse code.  Of
> course we want to benefit here from the fact that

[PATCH] regcprop: Determine subreg offset depending on endianness [PR101260]

2021-10-08 Thread Stefan Schulze Frielinghaus via Gcc-patches

gcc/ChangeLog:

* regcprop.c (maybe_mode_change): Determine offset relative to
high or low part depending on endianness.

Bootstrapped and regtested on IBM Z. Ok for mainline and gcc-{11,10,9}?

---
 gcc/regcprop.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/gcc/regcprop.c b/gcc/regcprop.c
index d2a01130fe1..0e1ac12458a 100644
--- a/gcc/regcprop.c
+++ b/gcc/regcprop.c
@@ -414,9 +414,14 @@ maybe_mode_change (machine_mode orig_mode, machine_mode 
copy_mode,
copy_nregs, _per_reg))
return NULL_RTX;
   poly_uint64 copy_offset = bytes_per_reg * (copy_nregs - use_nregs);
-  poly_uint64 offset
-   = subreg_size_lowpart_offset (GET_MODE_SIZE (new_mode) + copy_offset,
- GET_MODE_SIZE (orig_mode));
+  poly_uint64 offset =
+#if WORDS_BIG_ENDIAN
+   subreg_size_highpart_offset
+#else
+   subreg_size_lowpart_offset
+#endif
+   (GET_MODE_SIZE (new_mode) + copy_offset,
+GET_MODE_SIZE (orig_mode));
   regno += subreg_regno_offset (regno, orig_mode, offset, new_mode);
   if (targetm.hard_regno_mode_ok (regno, new_mode))
return gen_raw_REG (new_mode, regno);
-- 
2.31.1

[PATCH] IBM Z: Provide rawmemchr{qi,hi,si} expander

2021-09-20 Thread Stefan Schulze Frielinghaus via Gcc-patches

This patch implements the rawmemchr expander as introduced in
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579649.html

Bootstrapped and regtested in conjunction with the patch from above on
IBM Z.  Ok for mainline?
>From 551362cda54048dc1a51588112f11c070ed52020 Mon Sep 17 00:00:00 2001
From: Stefan Schulze Frielinghaus 
Date: Mon, 8 Feb 2021 10:35:39 +0100
Subject: [PATCH 2/2] IBM Z: Provide rawmemchr{qi,hi,si} expander

gcc/ChangeLog:

* config/s390/s390-protos.h (s390_rawmemchrqi): Add prototype.
(s390_rawmemchrhi): Add prototype.
(s390_rawmemchrsi): Add prototype.
* config/s390/s390.c (s390_rawmemchr): New function.
(s390_rawmemchrqi): New function.
(s390_rawmemchrhi): New function.
(s390_rawmemchrsi): New function.
* config/s390/s390.md (rawmemchr): New expander.
(rawmemchr): New expander.
* config/s390/vector.md (vec_vfees): Basically a copy of
the pattern vfees from vx-builtins.md.
* config/s390/vx-builtins.md (*vfees): Remove.

gcc/testsuite/ChangeLog:

* gcc.target/s390/rawmemchr-1.c: New test.
---
 gcc/config/s390/s390-protos.h   |  4 +
 gcc/config/s390/s390.c  | 89 ++
 gcc/config/s390/s390.md | 20 +
 gcc/config/s390/vector.md   | 26 ++
 gcc/config/s390/vx-builtins.md  | 26 --
 gcc/testsuite/gcc.target/s390/rawmemchr-1.c | 99 +
 6 files changed, 238 insertions(+), 26 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/s390/rawmemchr-1.c

diff --git a/gcc/config/s390/s390-protos.h b/gcc/config/s390/s390-protos.h
index 4b03c6e99f5..0d9619e8254 100644
--- a/gcc/config/s390/s390-protos.h
+++ b/gcc/config/s390/s390-protos.h
@@ -66,6 +66,10 @@ s390_asm_declare_function_size (FILE *asm_out_file,
const char *fnname ATTRIBUTE_UNUSED, tree decl);
 #endif
 
+extern void s390_rawmemchrqi(rtx dst, rtx src, rtx pat);
+extern void s390_rawmemchrhi(rtx dst, rtx src, rtx pat);
+extern void s390_rawmemchrsi(rtx dst, rtx src, rtx pat);
+
 #ifdef RTX_CODE
 extern int s390_extra_constraint_str (rtx, int, const char *);
 extern int s390_const_ok_for_constraint_p (HOST_WIDE_INT, int, const char *);
diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c
index 54dd6332c3a..1435ce156e2 100644
--- a/gcc/config/s390/s390.c
+++ b/gcc/config/s390/s390.c
@@ -16559,6 +16559,95 @@ s390_excess_precision (enum excess_precision_type type)
 }
 #endif
 
+template 
+static void
+s390_rawmemchr(rtx dst, rtx src, rtx pat) {
+  rtx lens = gen_reg_rtx (V16QImode);
+  rtx pattern = gen_reg_rtx (vec_mode);
+  rtx loop_start = gen_label_rtx ();
+  rtx loop_end = gen_label_rtx ();
+  rtx addr = gen_reg_rtx (Pmode);
+  rtx offset = gen_reg_rtx (Pmode);
+  rtx tmp = gen_reg_rtx (Pmode);
+  rtx loadlen = gen_reg_rtx (SImode);
+  rtx matchlen = gen_reg_rtx (SImode);
+  rtx mem;
+
+  pat = GEN_INT (trunc_int_for_mode (INTVAL (pat), elt_mode));
+  emit_insn (gen_rtx_SET (pattern, gen_rtx_VEC_DUPLICATE (vec_mode, pat)));
+
+  emit_move_insn (addr, XEXP (src, 0));
+
+  // alignment
+  emit_insn (gen_vlbb (lens, gen_rtx_MEM (BLKmode, addr), GEN_INT (6)));
+  emit_insn (gen_lcbb (loadlen, addr, GEN_INT (6)));
+  lens = convert_to_mode (vec_mode, lens, 1);
+  emit_insn (gen_vec_vfees (lens, lens, pattern, GEN_INT (0)));
+  lens = convert_to_mode (V4SImode, lens, 1);
+  emit_insn (gen_vec_extractv4sisi (matchlen, lens, GEN_INT (1)));
+  lens = convert_to_mode (vec_mode, lens, 1);
+  emit_cmp_and_jump_insns (matchlen, loadlen, LT, NULL_RTX, SImode, 1, 
loop_end);
+  force_expand_binop (Pmode, and_optab, addr, GEN_INT (15), tmp, 1, 
OPTAB_DIRECT);
+  force_expand_binop (Pmode, sub_optab, GEN_INT (16), tmp, tmp, 1, 
OPTAB_DIRECT);
+  force_expand_binop (Pmode, add_optab, addr, tmp, addr, 1, OPTAB_DIRECT);
+  // now, addr is 16-byte aligned
+
+  mem = gen_rtx_MEM (vec_mode, addr);
+  set_mem_align (mem, 128);
+  emit_move_insn (lens, mem);
+  emit_insn (gen_vec_vfees (lens, lens, pattern, GEN_INT (VSTRING_FLAG_CS)));
+  add_int_reg_note (s390_emit_ccraw_jump (4, EQ, loop_end),
+   REG_BR_PROB,
+   profile_probability::very_unlikely ().to_reg_br_prob_note 
());
+
+  emit_label (loop_start);
+  LABEL_NUSES (loop_start) = 1;
+
+  force_expand_binop (Pmode, add_optab, addr, GEN_INT (16), addr, 1, 
OPTAB_DIRECT);
+  mem = gen_rtx_MEM (vec_mode, addr);
+  set_mem_align (mem, 128);
+  emit_move_insn (lens, mem);
+  emit_insn (gen_vec_vfees (lens, lens, pattern, GEN_INT (VSTRING_FLAG_CS)));
+  add_int_reg_note (s390_emit_ccraw_jump (4, NE, loop_start),
+   REG_BR_PROB,
+   profile_probability::very_likely ().to_reg_br_prob_note ());
+
+  emit_label (loop_end);
+  LABEL_NUSES (loop_end) = 1;
+
+  if (TARGET_64BIT)
+{
+  lens = convert_to_mode (V2DImode, lens, 1);
+  emit_insn (gen_vec_extractv2didi (offset,

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-09-13 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > [...]
> > > > >
> > > > > +  /* Handle strlen like loops.  */
> > > > > +  if (store_dr == NULL
> > > > > +  && integer_zerop (pattern)
> > > > > +  && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > +  && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > +  && integer_onep (reduction_iv.step)
> > > > > +  && (types_compatible_p (TREE_TYPE (reduction_var), 
> > > > > size_type_node)
> > > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
> > > > > +{
> > > > >
> > > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > > The iteration
> > > > > only stops when you load a NUL and the increments just wrap along 
> > > > > (you're
> > > > > using the pointer IVs to compute the strlen result).  Can't you 
> > > > > simply truncate?
> > > >
> > > > I think truncation is enough as long as no overflow occurs in strlen or
> > > > strlen_using_rawmemchr.
> > > >
> > > > > For larger than size_type_node (actually larger than ptr_type_node 
> > > > > would matter
> > > > > I guess), the argument is that since pointer wrapping would be 
> > > > > undefined anyway
> > > > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > > > >
> > > > >   TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > (ptr_type_node)
> > > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > >
> > > > > ?
> > > >
> > > > Regarding the implementation which makes use of rawmemchr:
> > > >
> > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > > > the maximal length we can determine of a string where each character has
> > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > occurs, then an overflow occurs for reduction variable, too, and that
> > > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > > this would mean that a single string consumes more than half of the
> > > > virtual addressable memory.  At least for architectures where
> > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > > > to neglect the case where computing pointer difference may overflow.
> > > > Otherwise we are talking about strings with lenghts of multiple
> > > > pebibytes.  For other architectures we might have to be more precise
> > > > and make sure that reduction variable overflows first and that this is
> > > > undefined.
> > > >
> > > > Thus a conservative condition would be (I assumed that the size of any
> > > > integral type is a power of two which I'm not sure if this really holds;
> > > > IIRC the C standard requires only that the alignment is a power of two
> > > > but not necessarily the size so I might need to change this):
> > > >
> > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - 
> > > > log2 (sizeof (load_type))
> > > >or in other words return true if reduction variable overflows first
> > > >and false otherwise.  */
> > > >
> > > > static bool
> > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > {
> > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> > > > (reduction_var));
> > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT 
> > > > (load_type)));
> > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
> > > > size_exponent);
> > > > }
> > > >
> > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > && reduction_var_overflows_first (reduction_var, load_type)
> > > >
> > > > Regarding the implementation which makes use of strlen:
> > > >
> > > > I'm not sure what it means if strlen is called for a string with a
> > > > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > > architectures, a conservative condition would be:
> > > >
> > > > TYPE_PRECISION (size_type_node) == 64
> > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION 
> > > > (size_type_node))
> > > >
> > > > I still included the overflow undefined check for reduction variable in
> > > > order to rule out situations where the reduction variable is unsigned
> > > > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > > > too.

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-09-03 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
[...]
> > >
> > > +  /* Handle strlen like loops.  */
> > > +  if (store_dr == NULL
> > > +  && integer_zerop (pattern)
> > > +  && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > +  && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > +  && integer_onep (reduction_iv.step)
> > > +  && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
> > > +{
> > >
> > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > The iteration
> > > only stops when you load a NUL and the increments just wrap along (you're
> > > using the pointer IVs to compute the strlen result).  Can't you simply 
> > > truncate?
> >
> > I think truncation is enough as long as no overflow occurs in strlen or
> > strlen_using_rawmemchr.
> >
> > > For larger than size_type_node (actually larger than ptr_type_node would 
> > > matter
> > > I guess), the argument is that since pointer wrapping would be undefined 
> > > anyway
> > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > >
> > >   TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > (ptr_type_node)
> > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > >
> > > ?
> >
> > Regarding the implementation which makes use of rawmemchr:
> >
> > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > the maximal length we can determine of a string where each character has
> > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > ptrdiff type is undefined we have to make sure that if an overflow
> > occurs, then an overflow occurs for reduction variable, too, and that
> > this is undefined, too.  However, I'm not sure anymore whether we want
> > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > this would mean that a single string consumes more than half of the
> > virtual addressable memory.  At least for architectures where
> > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > to neglect the case where computing pointer difference may overflow.
> > Otherwise we are talking about strings with lenghts of multiple
> > pebibytes.  For other architectures we might have to be more precise
> > and make sure that reduction variable overflows first and that this is
> > undefined.
> >
> > Thus a conservative condition would be (I assumed that the size of any
> > integral type is a power of two which I'm not sure if this really holds;
> > IIRC the C standard requires only that the alignment is a power of two
> > but not necessarily the size so I might need to change this):
> >
> > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 
> > (sizeof (load_type))
> >or in other words return true if reduction variable overflows first
> >and false otherwise.  */
> >
> > static bool
> > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > {
> >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> > (reduction_var));
> >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT 
> > (load_type)));
> >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
> > size_exponent);
> > }
> >
> > TYPE_PRECISION (ptrdiff_type_node) == 64
> > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > && reduction_var_overflows_first (reduction_var, load_type)
> >
> > Regarding the implementation which makes use of strlen:
> >
> > I'm not sure what it means if strlen is called for a string with a
> > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > using rawmemchr where we neglect the case of an overflow for 64bit
> > architectures, a conservative condition would be:
> >
> > TYPE_PRECISION (size_type_node) == 64
> > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> >
> > I still included the overflow undefined check for reduction variable in
> > order to rule out situations where the reduction variable is unsigned
> > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > architectures.  Anyhow, while writing this down it becomes clear that
> > this deserves a comment which I will add once it becomes clear which way
> > to go.
> 
> I think all the arguments about objects bigger than half of the address-space
> also are valid for 32bit targets and thus 32bit size_type_node (or
> 32bit pointer size).
> I'm not actually sure what's the canonical type to check against, whether
> it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-08-06 Thread Stefan Schulze Frielinghaus via Gcc-patches

ping

On Fri, Jun 25, 2021 at 12:23:32PM +0200, Stefan Schulze Frielinghaus wrote:
> On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote:
> > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
> >  wrote:
> > >
> > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus 
> > > wrote:
> > > [...]
> > > > > but we won't ever arrive here because of the niters condition.  But
> > > > > yes, doing the pattern matching in the innermost loop processing code
> > > > > looks good to me - for the specific case it would be
> > > > >
> > > > >   /* Don't distribute loop if niters is unknown.  */
> > > > >   tree niters = number_of_latch_executions (loop);
> > > > >   if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > > ---> here?
> > > > > continue;
> > > >
> > > > Right, please find attached a new version of the patch where everything
> > > > is included in the loop distribution pass.  I will do a bootstrap and
> > > > regtest on IBM Z over night.  If you give me green light I will also do
> > > > the same on x86_64.
> > >
> > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> > > least the ldist-strlen testcase).  If you are Ok with the patch, then I
> > > would rebase and run the testsuites again and post a patch series
> > > including the rawmemchr implementation for IBM Z.
> > 
> > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
> > *loop, vec *work_list)
> >return work_list->length () > 0;
> >  }
> > 
> > +static void
> > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> > +   data_reference_p store_dr, tree base, tree 
> > pattern,
> > +   location_t loc)
> > +{
> > 
> > this new function needs a comment.  Applies to all of the new ones, btw.
> 
> Done.
> 
> > +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> > +  && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE 
> > (pattern));
> > 
> > this looks fragile and is probably unnecessary as well.
> > 
> > +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> > 
> > in general you want types_compatible_p () checks which for pointers means
> > all pointers are compatible ...
> 
> True, I removed both asserts.
> 
> > (skipping stuff)
> > 
> > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
> >   && !optimize_loop_for_speed_p (loop)))
> > continue;
> > 
> > -  /* Don't distribute loop if niters is unknown.  */
> > +  /* If niters is unknown don't distribute loop but rather try to 
> > transform
> > +it to a call to a builtin.  */
> >tree niters = number_of_latch_executions (loop);
> >if (niters == NULL_TREE || niters == chrec_dont_know)
> > -   continue;
> > +   {
> > + if (transform_reduction_loop (loop))
> > +   {
> > + changed = true;
> > + loops_to_be_destroyed.safe_push (loop);
> > + if (dump_file)
> > +   fprintf (dump_file, "Loop %d transformed into a
> > builtin.\n", loop->num);
> > +   }
> > + continue;
> > +   }
> > 
> > please look at
> > 
> >   if (nb_generated_loops + nb_generated_calls > 0)
> > {
> >   changed = true;
> >   if (dump_enabled_p ())
> > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >  loc, "Loop%s %d distributed: split to
> > %d loops "
> >  "and %d library calls.\n", str, loop->num,
> >  nb_generated_loops, nb_generated_calls);
> > 
> > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
> > transforms are reported with -fopt-info-loop
> 
> Done.
> 
> > +
> > +  return transform_reduction_loop_1 (loop, load_dr, store_dr, 
> > reduction_var);
> > +}
> > 
> > what's the point in tail-calling here and visually splitting the
> > function in half?
> 
> In the first place I thought that this is more pleasant since in
> transform_reduction_loop_1 it is settled that we have a single load,
> store, and reduction variable.  After refactoring this isn't true
> anymore and I inlined the function and made this clear via a comment.
> 
> > 
> > (sorry for picking random pieces now ;))
> > 
> > +  for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> > +  gsi_next (), ++ninsns)
> > +   {
> > 
> > this counts debug insns, I guess you want gsi_next_nondebug at least.
> > not sure why you are counting PHIs at all btw - for the loops you match
> > you are expecting at most two, one IV and eventually one for the virtual
> > operand of the store?
> 
> Yes, I removed the counting for the phi loop and changed to
> gsi_next_nondebug for both loops.
> 
> > 
> > + if (gimple_has_volatile_ops (phi))
> > +   return false;
> > 
> > PHIs never have volatile ops.
> > 
> > + if

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-06-25 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote:
> On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
> > [...]
> > > > but we won't ever arrive here because of the niters condition.  But
> > > > yes, doing the pattern matching in the innermost loop processing code
> > > > looks good to me - for the specific case it would be
> > > >
> > > >   /* Don't distribute loop if niters is unknown.  */
> > > >   tree niters = number_of_latch_executions (loop);
> > > >   if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > ---> here?
> > > > continue;
> > >
> > > Right, please find attached a new version of the patch where everything
> > > is included in the loop distribution pass.  I will do a bootstrap and
> > > regtest on IBM Z over night.  If you give me green light I will also do
> > > the same on x86_64.
> >
> > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> > least the ldist-strlen testcase).  If you are Ok with the patch, then I
> > would rebase and run the testsuites again and post a patch series
> > including the rawmemchr implementation for IBM Z.
> 
> @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
> *loop, vec *work_list)
>return work_list->length () > 0;
>  }
> 
> +static void
> +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> +   data_reference_p store_dr, tree base, tree 
> pattern,
> +   location_t loc)
> +{
> 
> this new function needs a comment.  Applies to all of the new ones, btw.

Done.

> +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> +  && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE 
> (pattern));
> 
> this looks fragile and is probably unnecessary as well.
> 
> +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> 
> in general you want types_compatible_p () checks which for pointers means
> all pointers are compatible ...

True, I removed both asserts.

> (skipping stuff)
> 
> @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
>   && !optimize_loop_for_speed_p (loop)))
> continue;
> 
> -  /* Don't distribute loop if niters is unknown.  */
> +  /* If niters is unknown don't distribute loop but rather try to 
> transform
> +it to a call to a builtin.  */
>tree niters = number_of_latch_executions (loop);
>if (niters == NULL_TREE || niters == chrec_dont_know)
> -   continue;
> +   {
> + if (transform_reduction_loop (loop))
> +   {
> + changed = true;
> + loops_to_be_destroyed.safe_push (loop);
> + if (dump_file)
> +   fprintf (dump_file, "Loop %d transformed into a
> builtin.\n", loop->num);
> +   }
> + continue;
> +   }
> 
> please look at
> 
>   if (nb_generated_loops + nb_generated_calls > 0)
> {
>   changed = true;
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>  loc, "Loop%s %d distributed: split to
> %d loops "
>  "and %d library calls.\n", str, loop->num,
>  nb_generated_loops, nb_generated_calls);
> 
> and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
> transforms are reported with -fopt-info-loop

Done.

> +
> +  return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var);
> +}
> 
> what's the point in tail-calling here and visually splitting the
> function in half?

In the first place I thought that this is more pleasant since in
transform_reduction_loop_1 it is settled that we have a single load,
store, and reduction variable.  After refactoring this isn't true
anymore and I inlined the function and made this clear via a comment.

> 
> (sorry for picking random pieces now ;))
> 
> +  for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> +  gsi_next (), ++ninsns)
> +   {
> 
> this counts debug insns, I guess you want gsi_next_nondebug at least.
> not sure why you are counting PHIs at all btw - for the loops you match
> you are expecting at most two, one IV and eventually one for the virtual
> operand of the store?

Yes, I removed the counting for the phi loop and changed to
gsi_next_nondebug for both loops.

> 
> + if (gimple_has_volatile_ops (phi))
> +   return false;
> 
> PHIs never have volatile ops.
> 
> + if (gimple_clobber_p (phi))
> +   continue;
> 
> or are clobbers.

Removed both.

> Btw, can you factor out a helper from find_single_drs working on a
> stmt to reduce code duplication?

Ahh sorry for that.  I've already done this in one of my first patches
but didn't copy that over.  Although my changes do not require a RDG the
whole pass is based

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-06-14 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
[...]
> > but we won't ever arrive here because of the niters condition.  But
> > yes, doing the pattern matching in the innermost loop processing code
> > looks good to me - for the specific case it would be
> > 
> >   /* Don't distribute loop if niters is unknown.  */
> >   tree niters = number_of_latch_executions (loop);
> >   if (niters == NULL_TREE || niters == chrec_dont_know)
> > ---> here?
> > continue;
> 
> Right, please find attached a new version of the patch where everything
> is included in the loop distribution pass.  I will do a bootstrap and
> regtest on IBM Z over night.  If you give me green light I will also do
> the same on x86_64.

Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
least the ldist-strlen testcase).  If you are Ok with the patch, then I
would rebase and run the testsuites again and post a patch series
including the rawmemchr implementation for IBM Z.

Cheers,
Stefan

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-05-20 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Thu, May 20, 2021 at 11:24:57AM +0200, Richard Biener wrote:
> On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> > >  wrote:
> > > >
> > > > [snip]
> > > >
> > > > Please find attached a new version of the patch.  A major change 
> > > > compared to
> > > > the previous patch is that I created a separate pass which hopefully 
> > > > makes
> > > > reviewing also easier since it is almost self-contained.  After 
> > > > realizing that
> > > > detecting loops which mimic the behavior of rawmemchr/strlen functions 
> > > > does not
> > > > really fit into the topic of loop distribution, I created a separate 
> > > > pass.
> > >
> > > It's true that these reduction-like patterns are more difficult than
> > > the existing
> > > memcpy/memset cases.
> > >
> > > >  Due
> > > > to this I was also able to play around a bit and schedule the pass at 
> > > > different
> > > > times.  Currently it is scheduled right before loop distribution where 
> > > > loop
> > > > header copying already took place which leads to the following effect.
> > >
> > > In fact I'd schedule it after loop distribution so there's the chance 
> > > that loop
> > > distribution can expose a loop that fits the new pattern.
> > >
> > > >  Running
> > > > this setup over
> > > >
> > > > char *t (char *p)
> > > > {
> > > >   for (; *p; ++p);
> > > >   return p;
> > > > }
> > > >
> > > > the new pass transforms
> > > >
> > > > char * t (char * p)
> > > > {
> > > >   char _1;
> > > >   char _7;
> > > >
> > > >[local count: 118111600]:
> > > >   _7 = *p_3(D);
> > > >   if (_7 != 0)
> > > > goto ; [89.00%]
> > > >   else
> > > > goto ; [11.00%]
> > > >
> > > >[local count: 105119324]:
> > > >
> > > >[local count: 955630225]:
> > > >   # p_8 = PHI 
> > > >   p_6 = p_8 + 1;
> > > >   _1 = *p_6;
> > > >   if (_1 != 0)
> > > > goto ; [89.00%]
> > > >   else
> > > > goto ; [11.00%]
> > > >
> > > >[local count: 105119324]:
> > > >   # p_2 = PHI 
> > > >   goto ; [100.00%]
> > > >
> > > >[local count: 850510901]:
> > > >   goto ; [100.00%]
> > > >
> > > >[local count: 12992276]:
> > > >
> > > >[local count: 118111600]:
> > > >   # p_9 = PHI 
> > > >   return p_9;
> > > >
> > > > }
> > > >
> > > > into
> > > >
> > > > char * t (char * p)
> > > > {
> > > >   char * _5;
> > > >   char _7;
> > > >
> > > >[local count: 118111600]:
> > > >   _7 = *p_3(D);
> > > >   if (_7 != 0)
> > > > goto ; [89.00%]
> > > >   else
> > > > goto ; [11.00%]
> > > >
> > > >[local count: 105119324]:
> > > >   _5 = p_3(D) + 1;
> > > >   p_10 = .RAWMEMCHR (_5, 0);
> > > >
> > > >[local count: 118111600]:
> > > >   # p_9 = PHI 
> > > >   return p_9;
> > > >
> > > > }
> > > >
> > > > which is fine so far.  However, I haven't made up my mind so far 
> > > > whether it is
> > > > worthwhile to spend more time in order to also eliminate the "first 
> > > > unrolling"
> > > > of the loop.
> > >
> > > Might be a phiopt transform ;)  Might apply to quite some set of
> > > builtins.  I wonder how the strlen case looks like though.
> > >
> > > > I gave it a shot by scheduling the pass prior pass copy header
> > > > and ended up with:
> > > >
> > > > char * t (char * p)
> > > > {
> > > >[local count: 118111600]:
> > > >   p_5 = .RAWMEMCHR (p_3(D), 0);
> > > >   return p_5;
> > > >
> > > > }
> > > >
> > > > which seems optimal to me.  The downside of this is that I have to 
> > > > initialize
> > > > scalar evolution analysis which might be undesired that early.
> > > >
> > > > All this brings me to the question where do you see this peace of code 
> > > > running?
> > > > If in a separate pass when would you schedule it?  If in an existing 
> > > > pass,
> > > > which one would you choose?
> > >
> > > I think it still fits loop distribution.  If you manage to detect it
> > > with your pass
> > > standalone then you should be able to detect it in loop distribution.
> >
> > If a loop is distributed only because one of the partitions matches a
> > rawmemchr/strlen-like loop pattern, then we have at least two partitions
> > which walk over the same memory region.  Since a rawmemchr/strlen-like
> > loop has no body (neglecting expression-3 of a for-loop where just an
> > increment happens) it is governed by the memory accesses in the loop
> > condition.  Therefore, in such a case loop distribution would result in
> > performance degradation.  This is why I think that it does not fit
> > conceptually into ldist pass.  However, since I make use of a couple of
> > helper functions from ldist pass, it may still fit technically.
> >
> > Since currently all ldist optimizations operate over loops where niters
> > is known and for rawmemchr/strlen-like loops this is not the case, it is
> > not possible that those optimizations expose a loop which is suitable
> > for

[PATCH] testsuite: Fix input operands of gcc.dg/guality/pr43077-1.c

2021-05-11 Thread Stefan Schulze Frielinghaus via Gcc-patches

The type of the output operands *p and *q of the extended asm statement
of function foo is unsigned long whereas the type of the corresponding
input operands is int.  This results, e.g. on IBM Z, in the case that
the immediates 2 and 3 are written into registers in SI mode and read in
DI mode resulting in wrong values.  Fixed by lifting the input operands
to type long.

gcc/testsuite/ChangeLog:

* gcc.dg/guality/pr43077-1.c: Align types of output and input
operands by lifting immediates to type long.

Ok for mainline?

---
 gcc/testsuite/gcc.dg/guality/pr43077-1.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/guality/pr43077-1.c 
b/gcc/testsuite/gcc.dg/guality/pr43077-1.c
index 39bd26aae01..2d9376298d4 100644
--- a/gcc/testsuite/gcc.dg/guality/pr43077-1.c
+++ b/gcc/testsuite/gcc.dg/guality/pr43077-1.c
@@ -24,7 +24,7 @@ int __attribute__((noinline))
 foo (unsigned long *p, unsigned long *q)
 {
   int ret;
-  asm volatile ("" : "=r" (ret), "=r" (*p), "=r" (*q) : "0" (1), "1" (2), "2" 
(3));
+  asm volatile ("" : "=r" (ret), "=r" (*p), "=r" (*q) : "0" (1), "1" (2l), "2" 
(3l));
   return ret;
 }
 
-- 
2.23.0

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-05-07 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > [snip]
> >
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing 
> > that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does 
> > not
> > really fit into the topic of loop distribution, I created a separate pass.
> 
> It's true that these reduction-like patterns are more difficult than
> the existing
> memcpy/memset cases.
> 
> >  Due
> > to this I was also able to play around a bit and schedule the pass at 
> > different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.
> 
> In fact I'd schedule it after loop distribution so there's the chance that 
> loop
> distribution can expose a loop that fits the new pattern.
> 
> >  Running
> > this setup over
> >
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > the new pass transforms
> >
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> >
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >
> >[local count: 955630225]:
> >   # p_8 = PHI 
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >   # p_2 = PHI 
> >   goto ; [100.00%]
> >
> >[local count: 850510901]:
> >   goto ; [100.00%]
> >
> >[local count: 12992276]:
> >
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> >
> > }
> >
> > into
> >
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> >
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> >
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> >
> > }
> >
> > which is fine so far.  However, I haven't made up my mind so far whether it 
> > is
> > worthwhile to spend more time in order to also eliminate the "first 
> > unrolling"
> > of the loop.
> 
> Might be a phiopt transform ;)  Might apply to quite some set of
> builtins.  I wonder how the strlen case looks like though.
> 
> > I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> >
> > char * t (char * p)
> > {
> >[local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> >
> > }
> >
> > which seems optimal to me.  The downside of this is that I have to 
> > initialize
> > scalar evolution analysis which might be undesired that early.
> >
> > All this brings me to the question where do you see this peace of code 
> > running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
> 
> I think it still fits loop distribution.  If you manage to detect it
> with your pass
> standalone then you should be able to detect it in loop distribution.

If a loop is distributed only because one of the partitions matches a
rawmemchr/strlen-like loop pattern, then we have at least two partitions
which walk over the same memory region.  Since a rawmemchr/strlen-like
loop has no body (neglecting expression-3 of a for-loop where just an
increment happens) it is governed by the memory accesses in the loop
condition.  Therefore, in such a case loop distribution would result in
performance degradation.  This is why I think that it does not fit
conceptually into ldist pass.  However, since I make use of a couple of
helper functions from ldist pass, it may still fit technically.

Since currently all ldist optimizations operate over loops where niters
is known and for rawmemchr/strlen-like loops this is not the case, it is
not possible that those optimizations expose a loop which is suitable
for rawmemchr/strlen optimization.  Therefore, what do you think about
scheduling rawmemchr/strlen optimization right between those
if-statements of function loop_distribution::execute?

   if (nb_generated_loops + nb_generated_calls > 0)
 {
   changed = true;
   if (dump_enabled_p ())
 dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
  loc, "Loop%s %d distributed: split to %d loops "
  "and %d library calls.\n", str, loop->num,
  nb_generated_loops, nb_generated_calls);

   break;
 }

   // rawmemchr/strlen like loops

   if (dump_file && (dump_flags & TDF_DETAILS))
 fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num);

> Can you
> explain what part is "easier" as

[PATCH] PR rtl-optimization/100263: Ensure register can change mode

2021-05-05 Thread Stefan Schulze Frielinghaus via Gcc-patches

For move2add_valid_value_p we also have to ask the target whether a
register can be accessed in a different mode than it was set before.

gcc/ChangeLog:

PR rtl-optimization/100263
* postreload.c (move2add_valid_value_p): Ensure register can
change mode.

Bootstrapped and regtested releases/gcc-{8,9,10,11} and master on IBM Z.
Ok for those branches?

---
 gcc/postreload.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/postreload.c b/gcc/postreload.c
index dc67643384d..60a622dbaf3 100644
--- a/gcc/postreload.c
+++ b/gcc/postreload.c
@@ -1725,7 +1725,8 @@ move2add_valid_value_p (int regno, scalar_int_mode mode)
 {
   scalar_int_mode old_mode;
   if (!is_a  (reg_mode[regno], _mode)
- || !MODES_OK_FOR_MOVE2ADD (mode, old_mode))
+ || !MODES_OK_FOR_MOVE2ADD (mode, old_mode)
+ || !REG_CAN_CHANGE_MODE_P (regno, old_mode, mode))
return false;
   /* The value loaded into regno in reg_mode[regno] is also valid in
 mode after truncation only if (REG:mode regno) is the lowpart of
-- 
2.23.0

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-05-04 Thread Stefan Schulze Frielinghaus via Gcc-patches

ping

On Thu, Apr 08, 2021 at 10:23:31AM +0200, Stefan Schulze Frielinghaus wrote:
> ping
> 
> On Tue, Mar 16, 2021 at 06:13:21PM +0100, Stefan Schulze Frielinghaus wrote:
> > [snip]
> > 
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing 
> > that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does 
> > not
> > really fit into the topic of loop distribution, I created a separate pass.  
> > Due
> > to this I was also able to play around a bit and schedule the pass at 
> > different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.  
> > Running
> > this setup over
> > 
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> > 
> > the new pass transforms
> > 
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> > 
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> > 
> >[local count: 955630225]:
> >   # p_8 = PHI 
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> >   # p_2 = PHI 
> >   goto ; [100.00%]
> > 
> >[local count: 850510901]:
> >   goto ; [100.00%]
> > 
> >[local count: 12992276]:
> > 
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> > 
> > }
> > 
> > into
> > 
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> > 
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> > 
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> > 
> > }
> > 
> > which is fine so far.  However, I haven't made up my mind so far whether it 
> > is
> > worthwhile to spend more time in order to also eliminate the "first 
> > unrolling"
> > of the loop.  I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> > 
> > char * t (char * p)
> > {
> >[local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> > 
> > }
> > 
> > which seems optimal to me.  The downside of this is that I have to 
> > initialize
> > scalar evolution analysis which might be undesired that early.
> > 
> > All this brings me to the question where do you see this peace of code 
> > running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
> > 
> > Another topic which came up is whether there exists a more elegant solution 
> > to
> > my current implementation in order to deal with stores (I'm speaking of the 
> > `if
> > (store_dr)` statement inside of function transform_loop_1).  For example,
> > 
> > extern char *p;
> > char *t ()
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> > 
> > ends up as
> > 
> > char * t ()
> > {
> >   char * _1;
> >   char * _2;
> >   char _3;
> >   char * p.1_8;
> >   char _9;
> >   char * p.1_10;
> >   char * p.1_11;
> > 
> >[local count: 118111600]:
> >   p.1_8 = p;
> >   _9 = *p.1_8;
> >   if (_9 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> > 
> >[local count: 955630225]:
> >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> >   _1 = p.1_10 + 1;
> >   p = _1;
> >   _3 = *_1;
> >   if (_3 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> >   # _2 = PHI <_1(3)>
> >   goto ; [100.00%]
> > 
> >[local count: 850510901]:
> >   goto ; [100.00%]
> > 
> >[local count: 12992276]:
> > 
> >[local count: 118111600]:
> >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> >   return p.1_11;
> > 
> > }
> > 
> > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > have to show that we never load from a memory location to which we write.
> > Currently I solve this by hard coding those facts which are not generic at 
> > all.
> > I gave compute_data_dependences_for_loop a try which failed to determine the
> > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > there are more generic solutions to express this in contrast to my current 
> > one?
> > 
> > Thanks again for your input so far.  Really appreciated.
> > 
> > Cheers,
> > Stefan
> 
> > diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> > index 8a5fb3fd99c..7b2d7405277 100644
> > --- a/gcc/Makefile.in
> > +++ b/gcc/Makefile.in
> > @@ -1608,6 +1608,7 @@ OBJS = \
> > tree-into-ssa.o \
> > tree-iterator.o \
> > tree-loop-distribution.o \
> > +   tree-loop-pattern.o \
> > tree-nested.o \
> >

[PATCH] testsuite: Xfail gcc.dg/vect/pr71264.c on IBM Z

2021-04-20 Thread Stefan Schulze Frielinghaus via Gcc-patches

The test fails for targets with V4QImode support which is the case for
IBM Z.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr71264.c: Xfail on IBM Z due to V4QImode support.

Ok for mainline?

---
 gcc/testsuite/gcc.dg/vect/pr71264.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr71264.c 
b/gcc/testsuite/gcc.dg/vect/pr71264.c
index 5f6407a2411..dc849bf2797 100644
--- a/gcc/testsuite/gcc.dg/vect/pr71264.c
+++ b/gcc/testsuite/gcc.dg/vect/pr71264.c
@@ -19,5 +19,5 @@ void test(uint8_t *ptr, uint8_t *mask)
 }
 }
 
-/* { dg-final { scan-tree-dump "vectorized 1 loops in function" "vect" { xfail 
sparc*-*-* } } } */
+/* { dg-final { scan-tree-dump "vectorized 1 loops in function" "vect" { xfail 
s390*-*-* sparc*-*-* } } } */
 
-- 
2.23.0

[PATCH] testsuite: Fix gcc.dg/vect/bb-slp-39.c on IBM Z

2021-04-20 Thread Stefan Schulze Frielinghaus via Gcc-patches

On IBM Z the aliasing stores are realized through one element vector
instructions, if no cost model for vectorization is used which is the
default according to vect.exp.  Fixed by changing the number of times
the pattern must be found in the dump.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/bb-slp-39.c: Change number of times the pattern
must match for target IBM Z only.

Ok for mainline?

---
 gcc/testsuite/gcc.dg/vect/bb-slp-39.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-39.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-39.c
index 255bb1095dc..ee596cfa08b 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-39.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-39.c
@@ -16,4 +16,5 @@ void foo (double *p)
 }
 
 /* See that we vectorize three SLP instances.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "slp2" } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "slp2" { 
target { ! s390*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 5 "slp2" { 
target {   s390*-*-* } } } } */
-- 
2.23.0

[PATCH] testsuite: Fix up gcc.target/s390/zero-scratch-regs-1.c

2021-04-20 Thread Stefan Schulze Frielinghaus via Gcc-patches

Depending on whether GCC is configured using --with-mode=zarch or not,
for the 31bit target instructions are generated either for ESA or
z/Architecture.  For the sake of simplicity and robustness test only for
the latter by adding manually option -mzarch.

gcc/testsuite/ChangeLog:

* gcc.target/s390/zero-scratch-regs-1.c: Force test to run for
z/Architecture only.

Ok for mainline?

---
 .../gcc.target/s390/zero-scratch-regs-1.c | 95 ---
 1 file changed, 40 insertions(+), 55 deletions(-)

diff --git a/gcc/testsuite/gcc.target/s390/zero-scratch-regs-1.c 
b/gcc/testsuite/gcc.target/s390/zero-scratch-regs-1.c
index c394c4b69e7..1c02c0c4e51 100644
--- a/gcc/testsuite/gcc.target/s390/zero-scratch-regs-1.c
+++ b/gcc/testsuite/gcc.target/s390/zero-scratch-regs-1.c
@@ -1,65 +1,50 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fzero-call-used-regs=all -march=z13" } */
+/* { dg-options "-O2 -fzero-call-used-regs=all -march=z13 -mzarch" } */
 
 /* Ensure that all call clobbered GPRs, FPRs, and VRs are zeroed and all call
saved registers are kept. */
 
 void foo (void) { }
 
-/* { dg-final { scan-assembler-times "lhi\t" 6 { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lhi\t%r0,0" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lhi\t%r1,0" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lhi\t%r2,0" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lhi\t%r3,0" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lhi\t%r4,0" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lhi\t%r5,0" { target { ! lp64 } } } } */
+/* { dg-final { scan-assembler-times "lghi\t" 6 } } */
+/* { dg-final { scan-assembler "lghi\t%r0,0" } } */
+/* { dg-final { scan-assembler "lghi\t%r1,0" } } */
+/* { dg-final { scan-assembler "lghi\t%r2,0" } } */
+/* { dg-final { scan-assembler "lghi\t%r3,0" } } */
+/* { dg-final { scan-assembler "lghi\t%r4,0" } } */
+/* { dg-final { scan-assembler "lghi\t%r5,0" } } */
 
-/* { dg-final { scan-assembler-times "lzdr\t" 14 { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f0" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f1" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f2" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f3" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f5" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f7" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f8" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f9" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f10" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f11" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f12" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f13" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f14" { target { ! lp64 } } } } */
-/* { dg-final { scan-assembler "lzdr\t%f15" { target { ! lp64 } } } } */
-
-/* { dg-final { scan-assembler-times "lghi\t" 6 { target { lp64 } } } } */
-/* { dg-final { scan-assembler "lghi\t%r0,0" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "lghi\t%r1,0" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "lghi\t%r2,0" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "lghi\t%r3,0" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "lghi\t%r4,0" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "lghi\t%r5,0" { target { lp64 } } } } */
-
-/* { dg-final { scan-assembler-times "vzero\t" 24 { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v0" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v1" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v2" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v3" { target { lp64 } } } } */
+/* { dg-final { scan-assembler-times "vzero\t" 30 { target { ! lp64 } } } } */
+/* { dg-final { scan-assembler-times "vzero\t" 24 { target {   lp64 } } } } */
+/* { dg-final { scan-assembler "vzero\t%v0" } } */
+/* { dg-final { scan-assembler "vzero\t%v1" } } */
+/* { dg-final { scan-assembler "vzero\t%v2" } } */
+/* { dg-final { scan-assembler "vzero\t%v3" } } */
 /* { dg-final { scan-assembler "vzero\t%v4" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v5" { target { lp64 } } } } */
+/* { dg-final { scan-assembler "vzero\t%v5" } } */
 /* { dg-final { scan-assembler "vzero\t%v6" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v7" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v16" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v17" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v18" { target { lp64 } } } } */
-/* { dg-final { scan-assembler "vzero\t%v19" { target {

[PATCH] testsuite: Fix pr83403-{1,2}.c on IBM Z

2021-04-16 Thread Stefan Schulze Frielinghaus via Gcc-patches

For z10 and newer inner loops are completely unrolled which means store
motion is not applied.  Reverting max-completely-peeled-insns to the
default value fixes these testcases.

Ok for mainline?

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr83403-1.c: Revert
max-completely-peeled-insns to the default value on IBM Z.
* gcc.dg/tree-ssa/pr83403-2.c: Likewise.
---
 gcc/testsuite/gcc.dg/tree-ssa/pr83403-1.c | 1 +
 gcc/testsuite/gcc.dg/tree-ssa/pr83403-2.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr83403-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr83403-1.c
index 748375b03af..bfc703d1aa6 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr83403-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr83403-1.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O3 -funroll-loops -fdump-tree-lim2-details" } */
+/* { dg-additional-options "--param max-completely-peeled-insns=200" { target 
{ s390*-*-* } } } */
 
 #define TYPE unsigned int
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr83403-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr83403-2.c
index ca2e6bbd61c..9130d9bd583 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr83403-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr83403-2.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O3 -funroll-loops -fdump-tree-lim2-details" } */
+/* { dg-additional-options "--param max-completely-peeled-insns=200" { target 
{ s390*-*-* } } } */
 
 #define TYPE int
 
-- 
2.23.0

[PATCH] testsuite: Enable zero-scratch-regs-{8,9,10,11}.c on s390*

2021-04-15 Thread Stefan Schulze Frielinghaus via Gcc-patches

On s390* the only missing part for the mentioned testcases was a load of
a double floating-point zero via a move (in particular for quite old
machines) which was added in commit 46c47420a5fefd4d9d02b0db347235dd74e20fb2.
Common code implementation is sufficient in order to clear volatile
GPRs, FPRs, and VRs.  Access registers a0 and a1 are nonvolatile and not
cleared.  Therefore, target hook TARGET_ZERO_CALL_USED_REGS is not
implemented for s390*.

Added a target specific test in order to ensure that all call clobbered
GPRs, FPRs, and VRs are zeroed and all call saved registers are kept.

Ok for mainline?

gcc/testsuite/ChangeLog:

* c-c++-common/zero-scratch-regs-8.c: Enable on s390*.
* c-c++-common/zero-scratch-regs-9.c: Likewise.
* c-c++-common/zero-scratch-regs-10.c: Likewise.
* c-c++-common/zero-scratch-regs-11.c: Likewise.
* gcc.target/s390/zero-scratch-regs-1.c: New test.
---
 .../c-c++-common/zero-scratch-regs-10.c   |  2 +-
 .../c-c++-common/zero-scratch-regs-11.c   |  2 +-
 .../c-c++-common/zero-scratch-regs-8.c|  2 +-
 .../c-c++-common/zero-scratch-regs-9.c|  2 +-
 .../gcc.target/s390/zero-scratch-regs-1.c | 65 +++
 5 files changed, 69 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/s390/zero-scratch-regs-1.c

diff --git a/gcc/testsuite/c-c++-common/zero-scratch-regs-10.c 
b/gcc/testsuite/c-c++-common/zero-scratch-regs-10.c
index ab17143bc4b..96e0b79b328 100644
--- a/gcc/testsuite/c-c++-common/zero-scratch-regs-10.c
+++ b/gcc/testsuite/c-c++-common/zero-scratch-regs-10.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-skip-if "not implemented" { ! { i?86*-*-* x86_64*-*-* sparc*-*-* 
aarch64*-*-* nvptx*-*-* } } } */
+/* { dg-skip-if "not implemented" { ! { i?86*-*-* x86_64*-*-* sparc*-*-* 
aarch64*-*-* nvptx*-*-* s390*-*-* } } } */
 /* { dg-options "-O2" } */
 
 #include 
diff --git a/gcc/testsuite/c-c++-common/zero-scratch-regs-11.c 
b/gcc/testsuite/c-c++-common/zero-scratch-regs-11.c
index 6642a377798..0714f95a04f 100644
--- a/gcc/testsuite/c-c++-common/zero-scratch-regs-11.c
+++ b/gcc/testsuite/c-c++-common/zero-scratch-regs-11.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-skip-if "not implemented" { ! { i?86*-*-* x86_64*-*-* sparc*-*-* 
aarch64*-*-* arm*-*-* nvptx*-*-* } } } */
+/* { dg-skip-if "not implemented" { ! { i?86*-*-* x86_64*-*-* sparc*-*-* 
aarch64*-*-* arm*-*-* nvptx*-*-* s390*-*-* } } } */
 /* { dg-options "-O2 -fzero-call-used-regs=all" } */
 
 #include "zero-scratch-regs-10.c"
diff --git a/gcc/testsuite/c-c++-common/zero-scratch-regs-8.c 
b/gcc/testsuite/c-c++-common/zero-scratch-regs-8.c
index 867c6bdce2c..aceda7e5cb8 100644
--- a/gcc/testsuite/c-c++-common/zero-scratch-regs-8.c
+++ b/gcc/testsuite/c-c++-common/zero-scratch-regs-8.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-skip-if "not implemented" { ! { i?86*-*-* x86_64*-*-* sparc*-*-* 
aarch64*-*-* arm*-*-* nvptx*-*-* } } } */
+/* { dg-skip-if "not implemented" { ! { i?86*-*-* x86_64*-*-* sparc*-*-* 
aarch64*-*-* arm*-*-* nvptx*-*-* s390*-*-* } } } */
 /* { dg-options "-O2 -fzero-call-used-regs=all-arg" } */
 
 #include "zero-scratch-regs-1.c"
diff --git a/gcc/testsuite/c-c++-common/zero-scratch-regs-9.c 
b/gcc/testsuite/c-c++-common/zero-scratch-regs-9.c
index 4b45d7061df..f3152a7a732 100644
--- a/gcc/testsuite/c-c++-common/zero-scratch-regs-9.c
+++ b/gcc/testsuite/c-c++-common/zero-scratch-regs-9.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-skip-if "not implemented" { ! { i?86*-*-* x86_64*-*-* sparc*-*-* 
aarch64*-*-* arm*-*-* nvptx*-*-* } } } */
+/* { dg-skip-if "not implemented" { ! { i?86*-*-* x86_64*-*-* sparc*-*-* 
aarch64*-*-* arm*-*-* nvptx*-*-* s390*-*-* } } } */
 /* { dg-options "-O2 -fzero-call-used-regs=all" } */
 
 #include "zero-scratch-regs-1.c"
diff --git a/gcc/testsuite/gcc.target/s390/zero-scratch-regs-1.c 
b/gcc/testsuite/gcc.target/s390/zero-scratch-regs-1.c
new file mode 100644
index 000..c394c4b69e7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/s390/zero-scratch-regs-1.c
@@ -0,0 +1,65 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fzero-call-used-regs=all -march=z13" } */
+
+/* Ensure that all call clobbered GPRs, FPRs, and VRs are zeroed and all call
+   saved registers are kept. */
+
+void foo (void) { }
+
+/* { dg-final { scan-assembler-times "lhi\t" 6 { target { ! lp64 } } } } */
+/* { dg-final { scan-assembler "lhi\t%r0,0" { target { ! lp64 } } } } */
+/* { dg-final { scan-assembler "lhi\t%r1,0" { target { ! lp64 } } } } */
+/* { dg-final { scan-assembler "lhi\t%r2,0" { target { ! lp64 } } } } */
+/* { dg-final { scan-assembler "lhi\t%r3,0" { target { ! lp64 } } } } */
+/* { dg-final { scan-assembler "lhi\t%r4,0" { target { ! lp64 } } } } */
+/* { dg-final { scan-assembler "lhi\t%r5,0" { target { ! lp64 } } } } */
+
+/* { dg-final { scan-assembler-times "lzdr\t" 14 { target { ! lp64 } } } } */
+/* { dg-final { scan-assembler "lzdr\t%f0" { target { ! lp64 } } } } */
+/*

[PATCH] testsuite: Fix unroll-and-jam.c on IBM Z

2021-04-15 Thread Stefan Schulze Frielinghaus via Gcc-patches

For z10 and newer inner loops are completely unrolled which leaves no
inner loops to jam which renders this testcase to fail.  Reverting
max-completely-peel-times to the default value fixes this testcase.

gcc/testsuite/ChangeLog:

* gcc.dg/unroll-and-jam.c: Revert max-completely-peel-times to
the default value on IBM Z.

Ok for mainline?

---
 gcc/testsuite/gcc.dg/unroll-and-jam.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.dg/unroll-and-jam.c 
b/gcc/testsuite/gcc.dg/unroll-and-jam.c
index 7eb64217a05..b8f4f16dc74 100644
--- a/gcc/testsuite/gcc.dg/unroll-and-jam.c
+++ b/gcc/testsuite/gcc.dg/unroll-and-jam.c
@@ -1,5 +1,6 @@
 /* { dg-do run } */
 /* { dg-options "-O3 -floop-unroll-and-jam -fno-tree-loop-im --param 
unroll-jam-min-percent=0 -fdump-tree-unrolljam-details" } */
+/* { dg-additional-options "--param max-completely-peel-times=16" { target { 
s390*-*-* } } } */
 /* { dg-require-effective-target int32plus } */
 
 #include 
-- 
2.23.0

[PATCH] re PR tree-optimization/93210 (Sub-optimal code optimization on struct/combound constexpr (gcc vs. clang))

2021-04-14 Thread Stefan Schulze Frielinghaus via Gcc-patches

Regarding test gcc.dg/pr93210.c, on different targets GIMPLE code may
slightly differ which is why the scan-tree-dump-times directive may
fail.  For example, for a RETURN_EXPR on x86_64 we have

  return 0x11100f0e0d0c0a090807060504030201;

whereas on IBM Z the first operand is a RESULT_DECL like

   = 0x102030405060708090a0c0d0e0f1011;
  return ;

gcc/testsuite/ChangeLog:

* gcc.dg/pr93210.c: Adapt regex in order to also support a
RESULT_DECL as an operand for a RETURN_EXPR.

Ok for mainline?

---
 gcc/testsuite/gcc.dg/pr93210.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/pr93210.c b/gcc/testsuite/gcc.dg/pr93210.c
index ec4194b6b49..134d32bc505 100644
--- a/gcc/testsuite/gcc.dg/pr93210.c
+++ b/gcc/testsuite/gcc.dg/pr93210.c
@@ -1,7 +1,7 @@
 /* PR tree-optimization/93210 */
 /* { dg-do run } */
 /* { dg-options "-O2 -fdump-tree-optimized" } */
-/* { dg-final { scan-tree-dump-times "return \[0-9]\[0-9a-fA-FxX]*;" 31 
"optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?:return| =) 
\[0-9]\[0-9a-fA-FxX]*;" 31 "optimized" } } */
 
 #ifdef __SIZEOF_INT128__
 typedef unsigned __int128 L;
-- 
2.23.0

[PATCH] IBM Z: Add alternative to *movdi_{31, 64} in order to load a DFP zero

2021-04-12 Thread Stefan Schulze Frielinghaus via Gcc-patches

Bootstraped and regtested on IBM Z.  Ok for mainline?

gcc/ChangeLog:

* config/s390/s390.md ("*movdi_31", "*movdi_64"): Add
  alternative in order to load a DFP zero.
---
 gcc/config/s390/s390.md | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index c10f25b2472..7faf775fbf2 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -1868,9 +1868,9 @@
 
 (define_insn "*movdi_64"
   [(set (match_operand:DI 0 "nonimmediate_operand"
- "=d,d,d,d,d, d,d,
d,f,d,d,d,d,d,T,!*f,!*f,!*f,!R,!T,b,Q,d,t,Q,t,v,v,v,d,v,R,d")
+ "=d,d,d,d,d, d,d,
d,f,d,!*f,d,d,d,d,T,!*f,!*f,!*f,!R,!T,b,Q,d,t,Q,t,v,v,v,d,v,R,d")
 (match_operand:DI 1 "general_operand"
- " K,N0HD0,N1HD0,N2HD0,N3HD0,Os,N0SD0,N1SD0,d,f,L,b,d,T,d, *f,  R,  
T,*f,*f,d,K,t,d,t,Q,K,v,d,v,R,v,ZL"))]
+ " K,N0HD0,N1HD0,N2HD0,N3HD0,Os,N0SD0,N1SD0,d,f,j00,L,b,d,T,d, *f,  R, 
 T,*f,*f,d,K,t,d,t,Q,K,v,d,v,R,v,ZL"))]
   "TARGET_ZARCH"
   "@
lghi\t%0,%h1
@@ -1883,6 +1883,7 @@
llilf\t%0,%k1
ldgr\t%0,%1
lgdr\t%0,%1
+   lzdr\t%0
lay\t%0,%a1
lgrl\t%0,%1
lgr\t%0,%1
@@ -1906,13 +1907,13 @@
vleg\t%v0,%1,0
vsteg\t%v1,%0,0
larl\t%0,%1"
-  [(set_attr "op_type" "RI,RI,RI,RI,RI,RIL,RIL,RIL,RRE,RRE,RXY,RIL,RRE,RXY,
+  [(set_attr "op_type" "RI,RI,RI,RI,RI,RIL,RIL,RIL,RRE,RRE,RRE,RXY,RIL,RRE,RXY,
 RXY,RR,RX,RXY,RX,RXY,RIL,SIL,*,*,RS,RS,VRI,VRR,VRS,VRS,
 VRX,VRX,RIL")
-   (set_attr "type" "*,*,*,*,*,*,*,*,floaddf,floaddf,la,larl,lr,load,store,
+   (set_attr "type" 
"*,*,*,*,*,*,*,*,floaddf,floaddf,fsimpdf,la,larl,lr,load,store,
  floaddf,floaddf,floaddf,fstoredf,fstoredf,larl,*,*,*,*,
  *,*,*,*,*,*,*,larl")
-   (set_attr "cpu_facility" "*,*,*,*,*,extimm,extimm,extimm,dfp,dfp,longdisp,
+   (set_attr "cpu_facility" "*,*,*,*,*,extimm,extimm,extimm,dfp,dfp,*,longdisp,
  z10,*,*,*,*,*,longdisp,*,longdisp,
  z10,z10,*,*,*,*,vx,vx,vx,vx,vx,vx,*")
(set_attr "z10prop" "z10_fwd_A1,
@@ -1925,6 +1926,7 @@
 z10_fwd_E1,
 *,
 *,
+   *,
 z10_fwd_A1,
 z10_fwd_A3,
 z10_fr_E1,
@@ -1942,7 +1944,7 @@
 *,
 *,*,*,*,*,*,*,
 z10_super_A1")
-   (set_attr "relative_long" "*,*,*,*,*,*,*,*,*,*,
+   (set_attr "relative_long" "*,*,*,*,*,*,*,*,*,*,*,
   *,yes,*,*,*,*,*,*,*,*,
   yes,*,*,*,*,*,*,*,*,*,
   *,*,yes")
@@ -2002,9 +2004,9 @@
 
 (define_insn "*movdi_31"
   [(set (match_operand:DI 0 "nonimmediate_operand"
-"=d,d,Q,S,d  ,o,!*f,!*f,!*f,!R,!T,d")
+"=d,d,Q,S,d  ,o,!*f,!*f,!*f,!*f,!R,!T,d")
 (match_operand:DI 1 "general_operand"
-" Q,S,d,d,dPT,d, *f,  R,  T,*f,*f,b"))]
+" Q,S,d,d,dPT,d, *f,  R,  T,j00,*f,*f,b"))]
   "!TARGET_ZARCH"
   "@
lm\t%0,%N0,%S1
@@ -2016,12 +2018,13 @@
ldr\t%0,%1
ld\t%0,%1
ldy\t%0,%1
+   lzdr\t%0
std\t%1,%0
stdy\t%1,%0
#"
-  [(set_attr "op_type" "RS,RSY,RS,RSY,*,*,RR,RX,RXY,RX,RXY,*")
-   (set_attr "type" 
"lm,lm,stm,stm,*,*,floaddf,floaddf,floaddf,fstoredf,fstoredf,*")
-   (set_attr "cpu_facility" 
"*,longdisp,*,longdisp,*,*,*,*,longdisp,*,longdisp,z10")])
+  [(set_attr "op_type" "RS,RSY,RS,RSY,*,*,RR,RX,RXY,RRE,RX,RXY,*")
+   (set_attr "type" 
"lm,lm,stm,stm,*,*,floaddf,floaddf,floaddf,fsimpdf,fstoredf,fstoredf,*")
+   (set_attr "cpu_facility" 
"*,longdisp,*,longdisp,*,*,*,*,longdisp,*,*,longdisp,z10")])
 
 ; For a load from a symbol ref we can use one of the target registers
 ; together with larl to load the address.
-- 
2.23.0

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-04-08 Thread Stefan Schulze Frielinghaus via Gcc-patches

ping

On Tue, Mar 16, 2021 at 06:13:21PM +0100, Stefan Schulze Frielinghaus wrote:
> [snip]
> 
> Please find attached a new version of the patch.  A major change compared to
> the previous patch is that I created a separate pass which hopefully makes
> reviewing also easier since it is almost self-contained.  After realizing that
> detecting loops which mimic the behavior of rawmemchr/strlen functions does 
> not
> really fit into the topic of loop distribution, I created a separate pass.  
> Due
> to this I was also able to play around a bit and schedule the pass at 
> different
> times.  Currently it is scheduled right before loop distribution where loop
> header copying already took place which leads to the following effect.  
> Running
> this setup over
> 
> char *t (char *p)
> {
>   for (; *p; ++p);
>   return p;
> }
> 
> the new pass transforms
> 
> char * t (char * p)
> {
>   char _1;
>   char _7;
> 
>[local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
> 
>[local count: 955630225]:
>   # p_8 = PHI 
>   p_6 = p_8 + 1;
>   _1 = *p_6;
>   if (_1 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
>   # p_2 = PHI 
>   goto ; [100.00%]
> 
>[local count: 850510901]:
>   goto ; [100.00%]
> 
>[local count: 12992276]:
> 
>[local count: 118111600]:
>   # p_9 = PHI 
>   return p_9;
> 
> }
> 
> into
> 
> char * t (char * p)
> {
>   char * _5;
>   char _7;
> 
>[local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
>   _5 = p_3(D) + 1;
>   p_10 = .RAWMEMCHR (_5, 0);
> 
>[local count: 118111600]:
>   # p_9 = PHI 
>   return p_9;
> 
> }
> 
> which is fine so far.  However, I haven't made up my mind so far whether it is
> worthwhile to spend more time in order to also eliminate the "first unrolling"
> of the loop.  I gave it a shot by scheduling the pass prior pass copy header
> and ended up with:
> 
> char * t (char * p)
> {
>[local count: 118111600]:
>   p_5 = .RAWMEMCHR (p_3(D), 0);
>   return p_5;
> 
> }
> 
> which seems optimal to me.  The downside of this is that I have to initialize
> scalar evolution analysis which might be undesired that early.
> 
> All this brings me to the question where do you see this peace of code 
> running?
> If in a separate pass when would you schedule it?  If in an existing pass,
> which one would you choose?
> 
> Another topic which came up is whether there exists a more elegant solution to
> my current implementation in order to deal with stores (I'm speaking of the 
> `if
> (store_dr)` statement inside of function transform_loop_1).  For example,
> 
> extern char *p;
> char *t ()
> {
>   for (; *p; ++p);
>   return p;
> }
> 
> ends up as
> 
> char * t ()
> {
>   char * _1;
>   char * _2;
>   char _3;
>   char * p.1_8;
>   char _9;
>   char * p.1_10;
>   char * p.1_11;
> 
>[local count: 118111600]:
>   p.1_8 = p;
>   _9 = *p.1_8;
>   if (_9 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
> 
>[local count: 955630225]:
>   # p.1_10 = PHI <_1(6), p.1_8(5)>
>   _1 = p.1_10 + 1;
>   p = _1;
>   _3 = *_1;
>   if (_3 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
>   # _2 = PHI <_1(3)>
>   goto ; [100.00%]
> 
>[local count: 850510901]:
>   goto ; [100.00%]
> 
>[local count: 12992276]:
> 
>[local count: 118111600]:
>   # p.1_11 = PHI <_2(8), p.1_8(7)>
>   return p.1_11;
> 
> }
> 
> where inside the loop a load and store occurs.  For a rawmemchr like loop I
> have to show that we never load from a memory location to which we write.
> Currently I solve this by hard coding those facts which are not generic at 
> all.
> I gave compute_data_dependences_for_loop a try which failed to determine the
> fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> there are more generic solutions to express this in contrast to my current 
> one?
> 
> Thanks again for your input so far.  Really appreciated.
> 
> Cheers,
> Stefan

> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 8a5fb3fd99c..7b2d7405277 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1608,6 +1608,7 @@ OBJS = \
>   tree-into-ssa.o \
>   tree-iterator.o \
>   tree-loop-distribution.o \
> + tree-loop-pattern.o \
>   tree-nested.o \
>   tree-nrv.o \
>   tree-object-size.o \
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index dd7173126fb..957e96a46a4 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2917,6 +2917,33 @@ expand_VEC_CONVERT (internal_fn, gcall *)
>gcc_unreachable ();
>  }
>  
> +void
> +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> +{
> +  expand_operand ops[3];
> +
> +  tree lhs = gimple_call_lhs (stmt);
> +  if (!lhs)
> +return;
> +  tree lhs_type = TREE_TYPE

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-03-16 Thread Stefan Schulze Frielinghaus via Gcc-patches

[snip]

Please find attached a new version of the patch.  A major change compared to
the previous patch is that I created a separate pass which hopefully makes
reviewing also easier since it is almost self-contained.  After realizing that
detecting loops which mimic the behavior of rawmemchr/strlen functions does not
really fit into the topic of loop distribution, I created a separate pass.  Due
to this I was also able to play around a bit and schedule the pass at different
times.  Currently it is scheduled right before loop distribution where loop
header copying already took place which leads to the following effect.  Running
this setup over

char *t (char *p)
{
  for (; *p; ++p);
  return p;
}

the new pass transforms

char * t (char * p)
{
  char _1;
  char _7;

   [local count: 118111600]:
  _7 = *p_3(D);
  if (_7 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:

   [local count: 955630225]:
  # p_8 = PHI 
  p_6 = p_8 + 1;
  _1 = *p_6;
  if (_1 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  # p_2 = PHI 
  goto ; [100.00%]

   [local count: 850510901]:
  goto ; [100.00%]

   [local count: 12992276]:

   [local count: 118111600]:
  # p_9 = PHI 
  return p_9;

}

into

char * t (char * p)
{
  char * _5;
  char _7;

   [local count: 118111600]:
  _7 = *p_3(D);
  if (_7 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  _5 = p_3(D) + 1;
  p_10 = .RAWMEMCHR (_5, 0);

   [local count: 118111600]:
  # p_9 = PHI 
  return p_9;

}

which is fine so far.  However, I haven't made up my mind so far whether it is
worthwhile to spend more time in order to also eliminate the "first unrolling"
of the loop.  I gave it a shot by scheduling the pass prior pass copy header
and ended up with:

char * t (char * p)
{
   [local count: 118111600]:
  p_5 = .RAWMEMCHR (p_3(D), 0);
  return p_5;

}

which seems optimal to me.  The downside of this is that I have to initialize
scalar evolution analysis which might be undesired that early.

All this brings me to the question where do you see this peace of code running?
If in a separate pass when would you schedule it?  If in an existing pass,
which one would you choose?

Another topic which came up is whether there exists a more elegant solution to
my current implementation in order to deal with stores (I'm speaking of the `if
(store_dr)` statement inside of function transform_loop_1).  For example,

extern char *p;
char *t ()
{
  for (; *p; ++p);
  return p;
}

ends up as

char * t ()
{
  char * _1;
  char * _2;
  char _3;
  char * p.1_8;
  char _9;
  char * p.1_10;
  char * p.1_11;

   [local count: 118111600]:
  p.1_8 = p;
  _9 = *p.1_8;
  if (_9 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:

   [local count: 955630225]:
  # p.1_10 = PHI <_1(6), p.1_8(5)>
  _1 = p.1_10 + 1;
  p = _1;
  _3 = *_1;
  if (_3 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  # _2 = PHI <_1(3)>
  goto ; [100.00%]

   [local count: 850510901]:
  goto ; [100.00%]

   [local count: 12992276]:

   [local count: 118111600]:
  # p.1_11 = PHI <_2(8), p.1_8(7)>
  return p.1_11;

}

where inside the loop a load and store occurs.  For a rawmemchr like loop I
have to show that we never load from a memory location to which we write.
Currently I solve this by hard coding those facts which are not generic at all.
I gave compute_data_dependences_for_loop a try which failed to determine the
fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
there are more generic solutions to express this in contrast to my current one?

Thanks again for your input so far.  Really appreciated.

Cheers,
Stefan
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 8a5fb3fd99c..7b2d7405277 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1608,6 +1608,7 @@ OBJS = \
tree-into-ssa.o \
tree-iterator.o \
tree-loop-distribution.o \
+   tree-loop-pattern.o \
tree-nested.o \
tree-nrv.o \
tree-object-size.o \
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index dd7173126fb..957e96a46a4 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2917,6 +2917,33 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+return;
+  tree lhs_type = TREE_TYPE (lhs);
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand ([0], lhs_rtx, TYPE_MODE (lhs_type));
+
+  for (unsigned int i = 0; i < 2; ++i)
+{
+  tree rhs = gimple_call_arg (stmt, i);
+  tree rhs_type = TREE_TYPE (rhs);
+  rtx rhs_rtx = expand_normal (rhs);
+  create_input_operand ([i + 1], rhs_rtx, TYPE_MODE (rhs_type));
+}
+
+  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
+
+  expand_insn (icode,

[PATCH] cprop_hardreg: Ensure replacement reg has compatible mode [PR99221]

2021-03-12 Thread Stefan Schulze Frielinghaus via Gcc-patches

In addition to the existing check also ask the target whether a
replacement register may be accessed in a different mode than it was set
before.

Bootstrapped and regtested on IBM Z.  Ok for mainline?

gcc/ChangeLog:

* regcprop.c (find_oldest_value_reg): Ask target whether
  different mode is fine for replacement register.
---
 gcc/regcprop.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/regcprop.c b/gcc/regcprop.c
index e1342f56bd1..02753a12510 100644
--- a/gcc/regcprop.c
+++ b/gcc/regcprop.c
@@ -474,7 +474,8 @@ find_oldest_value_reg (enum reg_class cl, rtx reg, struct 
value_data *vd)
(set (...) (reg:DI r9))
  Replacing r9 with r11 is invalid.  */
   if (mode != vd->e[regno].mode
-  && REG_NREGS (reg) > hard_regno_nregs (regno, vd->e[regno].mode))
+  && (REG_NREGS (reg) > hard_regno_nregs (regno, vd->e[regno].mode)
+ || !REG_CAN_CHANGE_MODE_P (regno, mode, vd->e[regno].mode)))
 return NULL_RTX;
 
   for (i = vd->e[regno].oldest_regno; i != regno; i = vd->e[i].next_regno)
-- 
2.23.0

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-03-03 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Tue, Mar 02, 2021 at 01:29:59PM +0100, Richard Biener wrote:
> On Sun, Feb 14, 2021 at 11:27 AM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> > > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> > > Gcc-patches  wrote:
> > > >
> > > > This patch adds support for recognizing loops which mimic the behaviour
> > > > of function rawmemchr, and replaces those with an internal function call
> > > > in case a target provides them.  In contrast to the original rawmemchr
> > > > function, this patch also supports different instances where the memory
> > > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > > > respectively.
> > > >
> > > > This patch is not final and I'm looking for some feedback:
> > > >
> > > > Previously, only loops which mimic the behaviours of functions memset,
> > > > memcpy, and memmove have been detected and replaced by corresponding
> > > > function calls.  One characteristic of those loops/partitions is that
> > > > they don't have a reduction.  In contrast, loops which mimic the
> > > > behaviour of rawmemchr compute a result and therefore have a reduction.
> > > > My current attempt is to ensure that the reduction statement is not used
> > > > in any other partition and only in that case ignore the reduction and
> > > > replace the loop by a function call.  We then only need to replace the
> > > > reduction variable of the loop which contained the loop result by the
> > > > variable of the lhs of the internal function call.  This should ensure
> > > > that the transformation is correct independently of how partitions are
> > > > fused/distributed in the end.  Any thoughts about this?
> > >
> > > Currently we're forcing reduction partitions last (and force to have a 
> > > single
> > > one by fusing all partitions containing a reduction) because 
> > > code-generation
> > > does not properly update SSA form for the reduction results.  ISTR that
> > > might be just because we do not copy the LC PHI nodes or do not adjust
> > > them when copying.  That might not be an issue in case you replace the
> > > partition with a call.  I guess you can try to have a testcase with
> > > two rawmemchr patterns and a regular loop part that has to be scheduled
> > > inbetween both for correctness.
> >
> > Ah ok, in that case I updated my patch by removing the constraint that
> > the reduction statement must be in precisely one partition.  Please find
> > attached the testcases I came up so far.  Since transforming a loop into
> > a rawmemchr function call is backend dependend, I planned to include
> > those only in my backend patch.  I wasn't able to come up with any
> > testcase where a loop is distributed into multiple partitions and where
> > one is classified as a rawmemchr builtin.  The latter boils down to a
> > for loop with an empty body only in which case I suspect that loop
> > distribution shouldn't be done anyway.
> >
> > > > Furthermore, I simply added two new members (pattern, fn) to structure
> > > > builtin_info which I consider rather hacky.  For the long run I thought
> > > > about to split up structure builtin_info into a union where each member
> > > > is a structure for a particular builtin of a partition, i.e., something
> > > > like this:
> > > >
> > > > union builtin_info
> > > > {
> > > >   struct binfo_memset *memset;
> > > >   struct binfo_memcpymove *memcpymove;
> > > >   struct binfo_rawmemchr *rawmemchr;
> > > > };
> > > >
> > > > Such that a structure for one builtin does not get "polluted" by a
> > > > different one.  Any thoughts about this?
> > >
> > > Probably makes sense if the list of recognized patterns grow further.
> > >
> > > I see you use internal functions rather than builtin functions.  I guess
> > > that's OK.  But you use new target hooks for expansion where I think
> > > new optab entries similar to cmpmem would be more appropriate
> > > where the distinction between 8, 16 or 32 bits can be encoded in
> > > the modes.
> >
> > The optab implementation is really nice which allows me to use iterators
> > in the backend which in the end saves me some boiler plate code compared
> > to the previous implementation :)

Re: [PATCH] IBM Z: Fix testcase vcond-shift.c

2021-03-01 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Tue, Mar 02, 2021 at 08:08:14AM +0100, Andreas Krebbel wrote:
> On 3/1/21 5:00 PM, Stefan Schulze Frielinghaus wrote:
> > As of commit 3a6e3ad38a17a03ee0139b49a0946e7b9ded1eb1 expressions
> > x CMP y ? -1 : 0 are fold into x CMP y.  Due to this we do not see
> > shifts anymore after expand in our testcases but comparisons.  Thus
> > replace instructions vesraX by corresponding vchX.  Keep testcases
> > vchX_{lt,gt} where only a relational comparison is done and no shift in
> > order to keep test coverage for vectorization.
> 
> The vcond-shift optimization verified by the testcase is currently 
> implemented in s390_expand_vcond
> but due to the common code change we go the vec_cmp route now. So we probably 
> should do the same
> also in s390_expand_vec_compare now. Perhaps like this ... it appears to fix 
> the testcase for me:
> 
> diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c
> index 9d2cee950d0b..9d9f5a0f6f4e 100644
> --- a/gcc/config/s390/s390.c
> +++ b/gcc/config/s390/s390.c
> @@ -6562,6 +6562,7 @@ s390_expand_vec_compare (rtx target, enum rtx_code cond,
> 
>if (GET_MODE_CLASS (GET_MODE (cmp_op1)) == MODE_VECTOR_FLOAT)
>  {
> +  cmp_op2 = force_operand (cmp_op2, 0);
>switch (cond)
> {
>   /* NE a != b -> !(a == b) */
> @@ -6600,6 +6601,19 @@ s390_expand_vec_compare (rtx target, enum rtx_code 
> cond,
>  }
>else
>  {
> +  /* Turn x < 0 into x >> (bits - )  */
> +  if (cond == LT && cmp_op2 == CONST0_RTX (mode))
> +   {
> + int shift = GET_MODE_BITSIZE (GET_MODE_INNER (mode)) - 1;
> + rtx res = expand_simple_binop (mode, ASHIFTRT, cmp_op1,
> +GEN_INT (shift), target,
> +0, OPTAB_DIRECT);
> + if (res != target)
> +   emit_move_insn (target, res);
> + return;
> +   }
> +  cmp_op2 = force_operand (cmp_op2, 0);
> +
>switch (cond)
> {
>   /* NE: a != b -> !(a == b) */
> diff --git a/gcc/config/s390/vector.md b/gcc/config/s390/vector.md
> index bc52211c55e5..c80d582a300d 100644
> --- a/gcc/config/s390/vector.md
> +++ b/gcc/config/s390/vector.md
> @@ -1589,7 +1589,7 @@
>[(set (match_operand:  0 "register_operand" "")
> (match_operator: 1 "vcond_comparison_operator"
>   [(match_operand:V_HW 2 "register_operand" "")
> -  (match_operand:V_HW 3 "register_operand" "")]))]
> +  (match_operand:V_HW 3 "nonmemory_operand" "")]))]
>"TARGET_VX"
>  {
>s390_expand_vec_compare (operands[0], GET_CODE(operands[1]), operands[2], 
> operands[3]);

Sounds great to me.  Also eliminates the extra vzero :)

Cheers,
Stefan

> 
> Andreas
> 
> 
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * gcc.target/s390/vector/vcond-shift.c: Replace vesraX
> > instructions by corresponding vchX instructions.
> > ---
> >  .../gcc.target/s390/vector/vcond-shift.c  | 31 ++-
> >  1 file changed, 17 insertions(+), 14 deletions(-)
> > 
> > diff --git a/gcc/testsuite/gcc.target/s390/vector/vcond-shift.c 
> > b/gcc/testsuite/gcc.target/s390/vector/vcond-shift.c
> > index a6b4e97aa50..9e472aef960 100644
> > --- a/gcc/testsuite/gcc.target/s390/vector/vcond-shift.c
> > +++ b/gcc/testsuite/gcc.target/s390/vector/vcond-shift.c
> > @@ -3,10 +3,13 @@
> >  /* { dg-do compile { target { s390*-*-* } } } */
> >  /* { dg-options "-O3 -march=z13 -mzarch" } */
> >  
> > -/* { dg-final { scan-assembler-times "vesraf\t%v.?,%v.?,31" 6 } } */
> > -/* { dg-final { scan-assembler-times "vesrah\t%v.?,%v.?,15" 6 } } */
> > -/* { dg-final { scan-assembler-times "vesrab\t%v.?,%v.?,7" 6 } } */
> > -/* { dg-final { scan-assembler-not "vzero\t*" } } */
> > +/* { dg-final { scan-assembler-times "vzero\t" 9 } } */
> > +/* { dg-final { scan-assembler-times "vchf\t" 6 } } */
> > +/* { dg-final { scan-assembler-times "vesraf\t%v.?,%v.?,1" 2 } } */
> > +/* { dg-final { scan-assembler-times "vchh\t" 6 } } */
> > +/* { dg-final { scan-assembler-times "vesrah\t%v.?,%v.?,1" 2 } } */
> > +/* { dg-final { scan-assembler-times "vchb\t" 6 } } */
> > +/* { dg-final { scan-assembler-times "vesrab\t%v.?,%v.?,1" 2 } } */
> >  /* { dg-final { scan-assembler-times "vesrlf\t%v.?,%v.?,31" 4 } } */
> >  /* { dg-final { scan-assembler-times "vesrlh\t%v.?,%v.?,15" 4 } } */
> >  /* { dg-final { scan-assembler-times "vesrlb\t%v.?,%v.?,7" 4 } } */
> > @@ -15,19 +18,19 @@
> >  #define ITER(X) (2 * (16 / sizeof (X[1])))
> >  
> >  void
> > -vesraf_div (int *x)
> > +vchf_vesraf_div (int *x)
> >  {
> >int i;
> >int *xx = __builtin_assume_aligned (x, 8);
> >  
> >/* Should expand to (xx + (xx < 0 ? 1 : 0)) >> 1
> > - which in turn should get simplified to (xx + (xx >> 31)) >> 1.  */
> > + which in turn should get simplified to (xx - (xx < 0)) >> 1.  */
> >for (i = 0; i < ITER (xx); i++)
> >  xx[i] = xx[i] / 2;
> >  }
> >  
> >  void
> > -vesrah_div (short *x)
> >

[PATCH] IBM Z: Fix testcase vcond-shift.c

2021-03-01 Thread Stefan Schulze Frielinghaus via Gcc-patches

As of commit 3a6e3ad38a17a03ee0139b49a0946e7b9ded1eb1 expressions
x CMP y ? -1 : 0 are fold into x CMP y.  Due to this we do not see
shifts anymore after expand in our testcases but comparisons.  Thus
replace instructions vesraX by corresponding vchX.  Keep testcases
vchX_{lt,gt} where only a relational comparison is done and no shift in
order to keep test coverage for vectorization.

gcc/testsuite/ChangeLog:

* gcc.target/s390/vector/vcond-shift.c: Replace vesraX
instructions by corresponding vchX instructions.
---
 .../gcc.target/s390/vector/vcond-shift.c  | 31 ++-
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/gcc/testsuite/gcc.target/s390/vector/vcond-shift.c 
b/gcc/testsuite/gcc.target/s390/vector/vcond-shift.c
index a6b4e97aa50..9e472aef960 100644
--- a/gcc/testsuite/gcc.target/s390/vector/vcond-shift.c
+++ b/gcc/testsuite/gcc.target/s390/vector/vcond-shift.c
@@ -3,10 +3,13 @@
 /* { dg-do compile { target { s390*-*-* } } } */
 /* { dg-options "-O3 -march=z13 -mzarch" } */
 
-/* { dg-final { scan-assembler-times "vesraf\t%v.?,%v.?,31" 6 } } */
-/* { dg-final { scan-assembler-times "vesrah\t%v.?,%v.?,15" 6 } } */
-/* { dg-final { scan-assembler-times "vesrab\t%v.?,%v.?,7" 6 } } */
-/* { dg-final { scan-assembler-not "vzero\t*" } } */
+/* { dg-final { scan-assembler-times "vzero\t" 9 } } */
+/* { dg-final { scan-assembler-times "vchf\t" 6 } } */
+/* { dg-final { scan-assembler-times "vesraf\t%v.?,%v.?,1" 2 } } */
+/* { dg-final { scan-assembler-times "vchh\t" 6 } } */
+/* { dg-final { scan-assembler-times "vesrah\t%v.?,%v.?,1" 2 } } */
+/* { dg-final { scan-assembler-times "vchb\t" 6 } } */
+/* { dg-final { scan-assembler-times "vesrab\t%v.?,%v.?,1" 2 } } */
 /* { dg-final { scan-assembler-times "vesrlf\t%v.?,%v.?,31" 4 } } */
 /* { dg-final { scan-assembler-times "vesrlh\t%v.?,%v.?,15" 4 } } */
 /* { dg-final { scan-assembler-times "vesrlb\t%v.?,%v.?,7" 4 } } */
@@ -15,19 +18,19 @@
 #define ITER(X) (2 * (16 / sizeof (X[1])))
 
 void
-vesraf_div (int *x)
+vchf_vesraf_div (int *x)
 {
   int i;
   int *xx = __builtin_assume_aligned (x, 8);
 
   /* Should expand to (xx + (xx < 0 ? 1 : 0)) >> 1
- which in turn should get simplified to (xx + (xx >> 31)) >> 1.  */
+ which in turn should get simplified to (xx - (xx < 0)) >> 1.  */
   for (i = 0; i < ITER (xx); i++)
 xx[i] = xx[i] / 2;
 }
 
 void
-vesrah_div (short *x)
+vchh_vesrah_div (short *x)
 {
   int i;
   short *xx = __builtin_assume_aligned (x, 8);
@@ -38,7 +41,7 @@ vesrah_div (short *x)
 
 
 void
-vesrab_div (signed char *x)
+vchb_vesrab_div (signed char *x)
 {
   int i;
   signed char *xx = __builtin_assume_aligned (x, 8);
@@ -50,7 +53,7 @@ vesrab_div (signed char *x)
 
 
 int
-vesraf_lt (int *x)
+vchf_lt (int *x)
 {
   int i;
   int *xx = __builtin_assume_aligned (x, 8);
@@ -60,7 +63,7 @@ vesraf_lt (int *x)
 }
 
 int
-vesrah_lt (short *x)
+vchh_lt (short *x)
 {
   int i;
   short *xx = __builtin_assume_aligned (x, 8);
@@ -70,7 +73,7 @@ vesrah_lt (short *x)
 }
 
 int
-vesrab_lt (signed char *x)
+vchb_lt (signed char *x)
 {
   int i;
   signed char *xx = __builtin_assume_aligned (x, 8);
@@ -82,7 +85,7 @@ vesrab_lt (signed char *x)
 
 
 int
-vesraf_ge (int *x)
+vchf_ge (int *x)
 {
   int i;
   int *xx = __builtin_assume_aligned (x, 8);
@@ -92,7 +95,7 @@ vesraf_ge (int *x)
 }
 
 int
-vesrah_ge (short *x)
+vchh_ge (short *x)
 {
   int i;
   short *xx = __builtin_assume_aligned (x, 8);
@@ -102,7 +105,7 @@ vesrah_ge (short *x)
 }
 
 int
-vesrab_ge (signed char *x)
+vchb_ge (signed char *x)
 {
   int i;
   signed char *xx = __builtin_assume_aligned (x, 8);
-- 
2.23.0

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-02-25 Thread Stefan Schulze Frielinghaus via Gcc-patches

Ping

On Sun, Feb 14, 2021 at 11:27:40AM +0100, Stefan Schulze Frielinghaus wrote:
> On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> > Gcc-patches  wrote:
> > >
> > > This patch adds support for recognizing loops which mimic the behaviour
> > > of function rawmemchr, and replaces those with an internal function call
> > > in case a target provides them.  In contrast to the original rawmemchr
> > > function, this patch also supports different instances where the memory
> > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > > respectively.
> > >
> > > This patch is not final and I'm looking for some feedback:
> > >
> > > Previously, only loops which mimic the behaviours of functions memset,
> > > memcpy, and memmove have been detected and replaced by corresponding
> > > function calls.  One characteristic of those loops/partitions is that
> > > they don't have a reduction.  In contrast, loops which mimic the
> > > behaviour of rawmemchr compute a result and therefore have a reduction.
> > > My current attempt is to ensure that the reduction statement is not used
> > > in any other partition and only in that case ignore the reduction and
> > > replace the loop by a function call.  We then only need to replace the
> > > reduction variable of the loop which contained the loop result by the
> > > variable of the lhs of the internal function call.  This should ensure
> > > that the transformation is correct independently of how partitions are
> > > fused/distributed in the end.  Any thoughts about this?
> > 
> > Currently we're forcing reduction partitions last (and force to have a 
> > single
> > one by fusing all partitions containing a reduction) because code-generation
> > does not properly update SSA form for the reduction results.  ISTR that
> > might be just because we do not copy the LC PHI nodes or do not adjust
> > them when copying.  That might not be an issue in case you replace the
> > partition with a call.  I guess you can try to have a testcase with
> > two rawmemchr patterns and a regular loop part that has to be scheduled
> > inbetween both for correctness.
> 
> Ah ok, in that case I updated my patch by removing the constraint that
> the reduction statement must be in precisely one partition.  Please find
> attached the testcases I came up so far.  Since transforming a loop into
> a rawmemchr function call is backend dependend, I planned to include
> those only in my backend patch.  I wasn't able to come up with any
> testcase where a loop is distributed into multiple partitions and where
> one is classified as a rawmemchr builtin.  The latter boils down to a
> for loop with an empty body only in which case I suspect that loop
> distribution shouldn't be done anyway.
> 
> > > Furthermore, I simply added two new members (pattern, fn) to structure
> > > builtin_info which I consider rather hacky.  For the long run I thought
> > > about to split up structure builtin_info into a union where each member
> > > is a structure for a particular builtin of a partition, i.e., something
> > > like this:
> > >
> > > union builtin_info
> > > {
> > >   struct binfo_memset *memset;
> > >   struct binfo_memcpymove *memcpymove;
> > >   struct binfo_rawmemchr *rawmemchr;
> > > };
> > >
> > > Such that a structure for one builtin does not get "polluted" by a
> > > different one.  Any thoughts about this?
> > 
> > Probably makes sense if the list of recognized patterns grow further.
> > 
> > I see you use internal functions rather than builtin functions.  I guess
> > that's OK.  But you use new target hooks for expansion where I think
> > new optab entries similar to cmpmem would be more appropriate
> > where the distinction between 8, 16 or 32 bits can be encoded in
> > the modes.
> 
> The optab implementation is really nice which allows me to use iterators
> in the backend which in the end saves me some boiler plate code compared
> to the previous implementation :)
> 
> While using optabs now, I only require one additional member (pattern)
> in the builtin_info struct.  Thus I didn't want to overcomplicate things
> and kept the single struct approach as is.
> 
> For the long run, should I resubmit this patch once stage 1 opens or how
> would you propose to proceed?
> 
> Thanks for your review so far!
> 
> Cheers,
> Stefan
> 
> > 
> &g

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-02-14 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> Gcc-patches  wrote:
> >
> > This patch adds support for recognizing loops which mimic the behaviour
> > of function rawmemchr, and replaces those with an internal function call
> > in case a target provides them.  In contrast to the original rawmemchr
> > function, this patch also supports different instances where the memory
> > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > respectively.
> >
> > This patch is not final and I'm looking for some feedback:
> >
> > Previously, only loops which mimic the behaviours of functions memset,
> > memcpy, and memmove have been detected and replaced by corresponding
> > function calls.  One characteristic of those loops/partitions is that
> > they don't have a reduction.  In contrast, loops which mimic the
> > behaviour of rawmemchr compute a result and therefore have a reduction.
> > My current attempt is to ensure that the reduction statement is not used
> > in any other partition and only in that case ignore the reduction and
> > replace the loop by a function call.  We then only need to replace the
> > reduction variable of the loop which contained the loop result by the
> > variable of the lhs of the internal function call.  This should ensure
> > that the transformation is correct independently of how partitions are
> > fused/distributed in the end.  Any thoughts about this?
> 
> Currently we're forcing reduction partitions last (and force to have a single
> one by fusing all partitions containing a reduction) because code-generation
> does not properly update SSA form for the reduction results.  ISTR that
> might be just because we do not copy the LC PHI nodes or do not adjust
> them when copying.  That might not be an issue in case you replace the
> partition with a call.  I guess you can try to have a testcase with
> two rawmemchr patterns and a regular loop part that has to be scheduled
> inbetween both for correctness.

Ah ok, in that case I updated my patch by removing the constraint that
the reduction statement must be in precisely one partition.  Please find
attached the testcases I came up so far.  Since transforming a loop into
a rawmemchr function call is backend dependend, I planned to include
those only in my backend patch.  I wasn't able to come up with any
testcase where a loop is distributed into multiple partitions and where
one is classified as a rawmemchr builtin.  The latter boils down to a
for loop with an empty body only in which case I suspect that loop
distribution shouldn't be done anyway.

> > Furthermore, I simply added two new members (pattern, fn) to structure
> > builtin_info which I consider rather hacky.  For the long run I thought
> > about to split up structure builtin_info into a union where each member
> > is a structure for a particular builtin of a partition, i.e., something
> > like this:
> >
> > union builtin_info
> > {
> >   struct binfo_memset *memset;
> >   struct binfo_memcpymove *memcpymove;
> >   struct binfo_rawmemchr *rawmemchr;
> > };
> >
> > Such that a structure for one builtin does not get "polluted" by a
> > different one.  Any thoughts about this?
> 
> Probably makes sense if the list of recognized patterns grow further.
> 
> I see you use internal functions rather than builtin functions.  I guess
> that's OK.  But you use new target hooks for expansion where I think
> new optab entries similar to cmpmem would be more appropriate
> where the distinction between 8, 16 or 32 bits can be encoded in
> the modes.

The optab implementation is really nice which allows me to use iterators
in the backend which in the end saves me some boiler plate code compared
to the previous implementation :)

While using optabs now, I only require one additional member (pattern)
in the builtin_info struct.  Thus I didn't want to overcomplicate things
and kept the single struct approach as is.

For the long run, should I resubmit this patch once stage 1 opens or how
would you propose to proceed?

Thanks for your review so far!

Cheers,
Stefan

> 
> Richard.
> 
> > Cheers,
> > Stefan
> > ---
> >  gcc/internal-fn.c|  42 ++
> >  gcc/internal-fn.def  |   3 +
> >  gcc/target-insns.def |   3 +
> >  gcc/tree-loop-distribution.c | 257 ++-
> >  4 files changed, 272 insertions(+), 33 deletions(-)
> >
> > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > index dd7173126fb..9cd62544a1a 100644
> > --- a/gcc/internal-fn.c
> > +++ b/gcc/intern

[RFC] ldist: Recognize rawmemchr loop patterns

2021-02-08 Thread Stefan Schulze Frielinghaus via Gcc-patches

This patch adds support for recognizing loops which mimic the behaviour
of function rawmemchr, and replaces those with an internal function call
in case a target provides them.  In contrast to the original rawmemchr
function, this patch also supports different instances where the memory
pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
respectively.

This patch is not final and I'm looking for some feedback:

Previously, only loops which mimic the behaviours of functions memset,
memcpy, and memmove have been detected and replaced by corresponding
function calls.  One characteristic of those loops/partitions is that
they don't have a reduction.  In contrast, loops which mimic the
behaviour of rawmemchr compute a result and therefore have a reduction.
My current attempt is to ensure that the reduction statement is not used
in any other partition and only in that case ignore the reduction and
replace the loop by a function call.  We then only need to replace the
reduction variable of the loop which contained the loop result by the
variable of the lhs of the internal function call.  This should ensure
that the transformation is correct independently of how partitions are
fused/distributed in the end.  Any thoughts about this?

Furthermore, I simply added two new members (pattern, fn) to structure
builtin_info which I consider rather hacky.  For the long run I thought
about to split up structure builtin_info into a union where each member
is a structure for a particular builtin of a partition, i.e., something
like this:

union builtin_info
{
  struct binfo_memset *memset;
  struct binfo_memcpymove *memcpymove;
  struct binfo_rawmemchr *rawmemchr;
};

Such that a structure for one builtin does not get "polluted" by a
different one.  Any thoughts about this?

Cheers,
Stefan
---
 gcc/internal-fn.c|  42 ++
 gcc/internal-fn.def  |   3 +
 gcc/target-insns.def |   3 +
 gcc/tree-loop-distribution.c | 257 ++-
 4 files changed, 272 insertions(+), 33 deletions(-)

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index dd7173126fb..9cd62544a1a 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+static void
+expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
+{
+  if (targetm.have_rawmemchr8 ())
+{
+  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, 
EXPAND_WRITE);
+  rtx start = expand_normal (gimple_call_arg (stmt, 0));
+  rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
+  emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
+}
+  else
+gcc_unreachable();
+}
+
+static void
+expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
+{
+  if (targetm.have_rawmemchr16 ())
+{
+  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, 
EXPAND_WRITE);
+  rtx start = expand_normal (gimple_call_arg (stmt, 0));
+  rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
+  emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
+}
+  else
+gcc_unreachable();
+}
+
+static void
+expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
+{
+  if (targetm.have_rawmemchr32 ())
+{
+  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, 
EXPAND_WRITE);
+  rtx start = expand_normal (gimple_call_arg (stmt, 0));
+  rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
+  emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
+}
+  else
+gcc_unreachable();
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index daeace7a34e..34247859704 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -348,6 +348,9 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | 
ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index 672c35698d7..9248554cbf3 100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -106,3 +106,6 @@ DEF_TARGET_INSN (trap, (void))
 DEF_TARGET_INSN (unique, (void))
 DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
+DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx

Re: [PATCH] IBM Z: Fix output template for "*vfees"

2020-11-12 Thread Stefan Schulze Frielinghaus via Gcc-patches

As pointed out in
https://gcc.gnu.org/pipermail/gcc-patches/2020-November/558816.html
this instruction pattern will be removed anyway.  Thus we can ignore
this patch.

On Thu, Nov 12, 2020 at 01:25:35PM +0100, Stefan Schulze Frielinghaus wrote:
> Bootstrapped and regtested on IBM Z.  Ok for master?
> 
> gcc/ChangeLog:
> 
>   * config/s390/vx-builtins.md ("*vfees"): Fix output
> template.
> ---
>  gcc/config/s390/vx-builtins.md | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/config/s390/vx-builtins.md b/gcc/config/s390/vx-builtins.md
> index 010db4d1115..0c2e7170223 100644
> --- a/gcc/config/s390/vx-builtins.md
> +++ b/gcc/config/s390/vx-builtins.md
> @@ -1395,7 +1395,7 @@
>  
>if (flags == VSTRING_FLAG_ZS)
>  return "vfeezs\t%v0,%v1,%v2";
> -  return "vfees\t%v0,%v1,%v2,%b3";
> +  return "vfees\t%v0,%v1,%v2";
>  }
>[(set_attr "op_type" "VRR")])
>  
> -- 
> 2.28.0
>

Re: [PATCH] IBM Z: Define vec_vfees instruction pattern

2020-11-12 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Thu, Nov 12, 2020 at 02:18:13PM +0100, Andreas Krebbel wrote:
> On 12.11.20 13:21, Stefan Schulze Frielinghaus wrote:
> > Bootstrapped and regtested on IBM Z.  Ok for master?
> > 
> > gcc/ChangeLog:
> > 
> > * config/s390/vector.md ("vec_vfees"): New insn pattern.
> > ---
> >  gcc/config/s390/vector.md | 26 ++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/gcc/config/s390/vector.md b/gcc/config/s390/vector.md
> > index 31d323930b2..4333a2191ae 100644
> > --- a/gcc/config/s390/vector.md
> > +++ b/gcc/config/s390/vector.md
> > @@ -1798,6 +1798,32 @@
> >"vll\t%v0,%1,%2"
> >[(set_attr "op_type" "VRS")])
> >  
> > +; vfeebs, vfeehs, vfeefs
> > +; vfeezbs, vfeezhs, vfeezfs
> > +(define_insn "vec_vfees"
> > +  [(set (match_operand:VI_HW_QHS 0 "register_operand" "=v")
> > +   (unspec:VI_HW_QHS [(match_operand:VI_HW_QHS 1 "register_operand" "v")
> > +  (match_operand:VI_HW_QHS 2 "register_operand" "v")
> > +  (match_operand:QI 3 "const_mask_operand" "C")]
> > + UNSPEC_VEC_VFEE))
> > +   (set (reg:CCRAW CC_REGNUM)
> > +   (unspec:CCRAW [(match_dup 1)
> > +  (match_dup 2)
> > +  (match_dup 3)]
> > + UNSPEC_VEC_VFEECC))]
> > +  "TARGET_VX"
> > +{
> > +  unsigned HOST_WIDE_INT flags = UINTVAL (operands[3]);
> > +
> > +  gcc_assert (!(flags & ~(VSTRING_FLAG_ZS | VSTRING_FLAG_CS)));
> > +  flags &= ~VSTRING_FLAG_CS;
> > +
> > +  if (flags == VSTRING_FLAG_ZS)
> > +return "vfeezs\t%v0,%v1,%v2";
> > +  return "vfees\t%v0,%v1,%v2";
> > +}
> > +  [(set_attr "op_type" "VRR")])
> > +
> >  ; vfenebs, vfenehs, vfenefs
> >  ; vfenezbs, vfenezhs, vfenezfs
> >  (define_insn "vec_vfenes"
> > 
> 
> Since this is mostly a copy of the pattern in vx-builtins.md I think we 
> should remove the other
> version then.
> 
> I also would prefer this to be committed together with the code making use of 
> the expander. So far
> this would be dead code - right?

Ok, I will remove the dead code and commit this change in conjunction
with the user in a different patch.

Thanks,
Stefan

[PATCH] IBM Z: Fix output template for "*vfees"

2020-11-12 Thread Stefan Schulze Frielinghaus via Gcc-patches

Bootstrapped and regtested on IBM Z.  Ok for master?

gcc/ChangeLog:

* config/s390/vx-builtins.md ("*vfees"): Fix output
  template.
---
 gcc/config/s390/vx-builtins.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/s390/vx-builtins.md b/gcc/config/s390/vx-builtins.md
index 010db4d1115..0c2e7170223 100644
--- a/gcc/config/s390/vx-builtins.md
+++ b/gcc/config/s390/vx-builtins.md
@@ -1395,7 +1395,7 @@
 
   if (flags == VSTRING_FLAG_ZS)
 return "vfeezs\t%v0,%v1,%v2";
-  return "vfees\t%v0,%v1,%v2,%b3";
+  return "vfees\t%v0,%v1,%v2";
 }
   [(set_attr "op_type" "VRR")])
 
-- 
2.28.0

[PATCH] IBM Z: Define vec_vfees instruction pattern

2020-11-12 Thread Stefan Schulze Frielinghaus via Gcc-patches

Bootstrapped and regtested on IBM Z.  Ok for master?

gcc/ChangeLog:

* config/s390/vector.md ("vec_vfees"): New insn pattern.
---
 gcc/config/s390/vector.md | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/gcc/config/s390/vector.md b/gcc/config/s390/vector.md
index 31d323930b2..4333a2191ae 100644
--- a/gcc/config/s390/vector.md
+++ b/gcc/config/s390/vector.md
@@ -1798,6 +1798,32 @@
   "vll\t%v0,%1,%2"
   [(set_attr "op_type" "VRS")])
 
+; vfeebs, vfeehs, vfeefs
+; vfeezbs, vfeezhs, vfeezfs
+(define_insn "vec_vfees"
+  [(set (match_operand:VI_HW_QHS 0 "register_operand" "=v")
+   (unspec:VI_HW_QHS [(match_operand:VI_HW_QHS 1 "register_operand" "v")
+  (match_operand:VI_HW_QHS 2 "register_operand" "v")
+  (match_operand:QI 3 "const_mask_operand" "C")]
+ UNSPEC_VEC_VFEE))
+   (set (reg:CCRAW CC_REGNUM)
+   (unspec:CCRAW [(match_dup 1)
+  (match_dup 2)
+  (match_dup 3)]
+ UNSPEC_VEC_VFEECC))]
+  "TARGET_VX"
+{
+  unsigned HOST_WIDE_INT flags = UINTVAL (operands[3]);
+
+  gcc_assert (!(flags & ~(VSTRING_FLAG_ZS | VSTRING_FLAG_CS)));
+  flags &= ~VSTRING_FLAG_CS;
+
+  if (flags == VSTRING_FLAG_ZS)
+return "vfeezs\t%v0,%v1,%v2";
+  return "vfees\t%v0,%v1,%v2";
+}
+  [(set_attr "op_type" "VRR")])
+
 ; vfenebs, vfenehs, vfenefs
 ; vfenezbs, vfenezhs, vfenezfs
 (define_insn "vec_vfenes"
-- 
2.28.0

Re: [PING] [PATCH] S/390: Do not turn maybe-uninitialized warnings into errors

2020-10-30 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Wed, Oct 28, 2020 at 11:34:53AM -0600, Jeff Law wrote:
> 
> On 10/28/20 11:29 AM, Stefan Schulze Frielinghaus wrote:
> > On Wed, Oct 28, 2020 at 08:39:41AM -0600, Jeff Law wrote:
> >> On 10/28/20 3:38 AM, Stefan Schulze Frielinghaus via Gcc-patches wrote:
> >>> On Mon, Oct 05, 2020 at 02:02:57PM +0200, Stefan Schulze Frielinghaus via 
> >>> Gcc-patches wrote:
> >>>> On Tue, Sep 22, 2020 at 02:59:30PM +0200, Andreas Krebbel wrote:
> >>>>> On 15.09.20 17:02, Stefan Schulze Frielinghaus wrote:
> >>>>>> Over the last couple of months quite a few warnings about uninitialized
> >>>>>> variables were raised while building GCC.  A reason why these warnings
> >>>>>> show up on S/390 only is due to the aggressive inlining settings here.
> >>>>>> Some of these warnings (2c832ffedf0, b776bdca932, 2786c0221b6,
> >>>>>> 1657178f59b) could be fixed or in case of a false positive silenced by
> >>>>>> initializing the corresponding variable.  Since the latter reoccurs and
> >>>>>> while bootstrapping such warnings are turned into errors bootstrapping
> >>>>>> fails on S/390 consistently.  Therefore, for the moment do not turn
> >>>>>> those warnings into errors.
> >>>>>>
> >>>>>> config/ChangeLog:
> >>>>>>
> >>>>>>* warnings.m4: Do not turn maybe-uninitialized warnings into 
> >>>>>> errors
> >>>>>>on S/390.
> >>>>>>
> >>>>>> fixincludes/ChangeLog:
> >>>>>>
> >>>>>>* configure: Regenerate.
> >>>>>>
> >>>>>> gcc/ChangeLog:
> >>>>>>
> >>>>>>* configure: Regenerate.
> >>>>>>
> >>>>>> libcc1/ChangeLog:
> >>>>>>
> >>>>>>* configure: Regenerate.
> >>>>>>
> >>>>>> libcpp/ChangeLog:
> >>>>>>
> >>>>>>* configure: Regenerate.
> >>>>>>
> >>>>>> libdecnumber/ChangeLog:
> >>>>>>
> >>>>>>* configure: Regenerate.
> >>>>> That change looks good to me. Could a global reviewer please comment!
> >>>> Ping
> >>> Ping
> >> I think this would be a huge mistake to install.
> > The root cause why those false positives show up on S/390 only seems to
> > be of more aggressive inlining w.r.t. other architectures.  Because of
> > bigger caches and a rather huge function call overhead we greatly
> > benefit from those inlining parameters. Thus:
> >
> > 1) Reverting those parameters would have a negative performance impact.
> >
> > 2) Fixing the maybe-uninitialized warnings analysis itself seems not to
> >happen in the near future (assuming that it is fixable at all).
> >
> > 3) Silencing the warning by initialising the variable itself also seems
> >to be undesired and feels like a fight against windmills ;-)
> >
> > 4) Not lifting maybe-uninitialized warnings to errors on S/390 only.
> >
> > Option (4) has the least intrusive effect to me.  At least then it is
> > not necessary to bootstrap with --disable-werror and we would still
> > treat all other warnings as errors.  All maybe-uninitialized warnings
> > which are triggered in common code with non-aggressive inlining are
> > still caught by other architectures.  Therefore, I'm wondering why this
> > should be a huge mistake?  What would you propose instead?
> 
> I'm aware of all that.  What I think it all argues is that y'all need to
> address the issues because of how you've changed the tuning on the s390
> port.  Simply disabling things like you've suggested is, IMHO, horribly
> wrong.
> 
> 
> Improve the analysis, dummy initializers, pragmas all seem viable.  But
> again, it feels like it's something the s390 maintainers will have to
> take the lead on because of how you've retuned the port.

Fixing the analysis is of course the best option.  However, this sounds
like a non-trivial task to me and I'm missing a lot of context here,
i.e., I'm not sure what the initial goals were and if it is possible to
meet those with the requirements which are necessary to solve those
false positives (currently having PR96564 in mind where it was mentioned
that alias info is not enough but also flow-based info is required; does
this

Re: [PING] [PATCH] S/390: Do not turn maybe-uninitialized warnings into errors

2020-10-28 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Wed, Oct 28, 2020 at 08:39:41AM -0600, Jeff Law wrote:
> 
> On 10/28/20 3:38 AM, Stefan Schulze Frielinghaus via Gcc-patches wrote:
> > On Mon, Oct 05, 2020 at 02:02:57PM +0200, Stefan Schulze Frielinghaus via 
> > Gcc-patches wrote:
> >> On Tue, Sep 22, 2020 at 02:59:30PM +0200, Andreas Krebbel wrote:
> >>> On 15.09.20 17:02, Stefan Schulze Frielinghaus wrote:
> >>>> Over the last couple of months quite a few warnings about uninitialized
> >>>> variables were raised while building GCC.  A reason why these warnings
> >>>> show up on S/390 only is due to the aggressive inlining settings here.
> >>>> Some of these warnings (2c832ffedf0, b776bdca932, 2786c0221b6,
> >>>> 1657178f59b) could be fixed or in case of a false positive silenced by
> >>>> initializing the corresponding variable.  Since the latter reoccurs and
> >>>> while bootstrapping such warnings are turned into errors bootstrapping
> >>>> fails on S/390 consistently.  Therefore, for the moment do not turn
> >>>> those warnings into errors.
> >>>>
> >>>> config/ChangeLog:
> >>>>
> >>>>  * warnings.m4: Do not turn maybe-uninitialized warnings into errors
> >>>>  on S/390.
> >>>>
> >>>> fixincludes/ChangeLog:
> >>>>
> >>>>  * configure: Regenerate.
> >>>>
> >>>> gcc/ChangeLog:
> >>>>
> >>>>  * configure: Regenerate.
> >>>>
> >>>> libcc1/ChangeLog:
> >>>>
> >>>>  * configure: Regenerate.
> >>>>
> >>>> libcpp/ChangeLog:
> >>>>
> >>>>  * configure: Regenerate.
> >>>>
> >>>> libdecnumber/ChangeLog:
> >>>>
> >>>>  * configure: Regenerate.
> >>> That change looks good to me. Could a global reviewer please comment!
> >> Ping
> > Ping
> 
> I think this would be a huge mistake to install.

The root cause why those false positives show up on S/390 only seems to
be of more aggressive inlining w.r.t. other architectures.  Because of
bigger caches and a rather huge function call overhead we greatly
benefit from those inlining parameters. Thus:

1) Reverting those parameters would have a negative performance impact.

2) Fixing the maybe-uninitialized warnings analysis itself seems not to
   happen in the near future (assuming that it is fixable at all).

3) Silencing the warning by initialising the variable itself also seems
   to be undesired and feels like a fight against windmills ;-)

4) Not lifting maybe-uninitialized warnings to errors on S/390 only.

Option (4) has the least intrusive effect to me.  At least then it is
not necessary to bootstrap with --disable-werror and we would still
treat all other warnings as errors.  All maybe-uninitialized warnings
which are triggered in common code with non-aggressive inlining are
still caught by other architectures.  Therefore, I'm wondering why this
should be a huge mistake?  What would you propose instead?

Cheers,
Stefan

Re: [PING] [PATCH] S/390: Do not turn maybe-uninitialized warnings into errors

2020-10-28 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Mon, Oct 05, 2020 at 02:02:57PM +0200, Stefan Schulze Frielinghaus via 
Gcc-patches wrote:
> On Tue, Sep 22, 2020 at 02:59:30PM +0200, Andreas Krebbel wrote:
> > On 15.09.20 17:02, Stefan Schulze Frielinghaus wrote:
> > > Over the last couple of months quite a few warnings about uninitialized
> > > variables were raised while building GCC.  A reason why these warnings
> > > show up on S/390 only is due to the aggressive inlining settings here.
> > > Some of these warnings (2c832ffedf0, b776bdca932, 2786c0221b6,
> > > 1657178f59b) could be fixed or in case of a false positive silenced by
> > > initializing the corresponding variable.  Since the latter reoccurs and
> > > while bootstrapping such warnings are turned into errors bootstrapping
> > > fails on S/390 consistently.  Therefore, for the moment do not turn
> > > those warnings into errors.
> > > 
> > > config/ChangeLog:
> > > 
> > >   * warnings.m4: Do not turn maybe-uninitialized warnings into errors
> > >   on S/390.
> > > 
> > > fixincludes/ChangeLog:
> > > 
> > >   * configure: Regenerate.
> > > 
> > > gcc/ChangeLog:
> > > 
> > >   * configure: Regenerate.
> > > 
> > > libcc1/ChangeLog:
> > > 
> > >   * configure: Regenerate.
> > > 
> > > libcpp/ChangeLog:
> > > 
> > >   * configure: Regenerate.
> > > 
> > > libdecnumber/ChangeLog:
> > > 
> > >   * configure: Regenerate.
> > 
> > That change looks good to me. Could a global reviewer please comment!
> 
> Ping

Ping

> 
> > 
> > Andreas
> > 
> > > ---
> > >  config/warnings.m4 | 20 ++--
> > >  fixincludes/configure  |  8 +++-
> > >  gcc/configure  | 12 +---
> > >  libcc1/configure   |  8 +++-
> > >  libcpp/configure   |  8 +++-
> > >  libdecnumber/configure |  8 +++-
> > >  6 files changed, 51 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/config/warnings.m4 b/config/warnings.m4
> > > index ce007f9b73e..d977bfb20af 100644
> > > --- a/config/warnings.m4
> > > +++ b/config/warnings.m4
> > > @@ -101,8 +101,10 @@ AC_ARG_ENABLE(werror-always,
> > >  AS_HELP_STRING([--enable-werror-always],
> > >  [enable -Werror despite compiler version]),
> > >  [], [enable_werror_always=no])
> > > -AS_IF([test $enable_werror_always = yes],
> > > -  [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
> > > +AS_IF([test $enable_werror_always = yes], [dnl
> > > +  acx_Var="$acx_Var${acx_Var:+ }-Werror"
> > > +  AS_CASE([$host], [s390*-*-*],
> > > +  [acx_Var="$acx_Var -Wno-error=maybe-uninitialized"])])
> > >   m4_if($1, [manual],,
> > >   [AS_VAR_PUSHDEF([acx_GCCvers], [acx_cv_prog_cc_gcc_$1_or_newer])dnl
> > >AC_CACHE_CHECK([whether $CC is GCC >=$1], acx_GCCvers,
> > > @@ -116,7 +118,9 @@ AS_IF([test $enable_werror_always = yes],
> > > [AS_VAR_SET(acx_GCCvers, yes)],
> > > [AS_VAR_SET(acx_GCCvers, no)])])
> > >   AS_IF([test AS_VAR_GET(acx_GCCvers) = yes],
> > > -   [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
> > > +   [acx_Var="$acx_Var${acx_Var:+ }-Werror"
> > > +AS_CASE([$host], [s390*-*-*],
> > > +[acx_Var="$acx_Var -Wno-error=maybe-uninitialized"])])
> > >AS_VAR_POPDEF([acx_GCCvers])])
> > >  m4_popdef([acx_Var])dnl
> > >  AC_LANG_POP(C)
> > > @@ -205,8 +209,10 @@ AC_ARG_ENABLE(werror-always,
> > >  AS_HELP_STRING([--enable-werror-always],
> > >  [enable -Werror despite compiler version]),
> > >  [], [enable_werror_always=no])
> > > -AS_IF([test $enable_werror_always = yes],
> > > -  [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
> > > +AS_IF([test $enable_werror_always = yes], [dnl
> > > +  acx_Var="$acx_Var${acx_Var:+ }-Werror"
> > > +  AS_CASE([$host], [s390*-*-*],
> > > +  [strict_warn="$strict_warn -Wno-error=maybe-uninitialized"])])
> > >   m4_if($1, [manual],,
> > >   [AS_VAR_PUSHDEF([acx_GXXvers], [acx_cv_prog_cxx_gxx_$1_or_newer])dnl
> > >AC_CACHE_CHECK([whether $CXX is G++ >=$1], acx_GXXvers,
> > > @@ -220,7 +226,9 @@ AS_IF([test $enable_werror_always = yes],
> > > [AS_VAR_SET(acx_GXXvers, yes)],
> > > [AS_VAR_SET(acx_GXXve

[PATCH] IBM Z: Emit vector alignment hints for strlen

2020-10-18 Thread Stefan Schulze Frielinghaus via Gcc-patches

In case the vectorized version of strlen is used, then each memory
access inside the loop is 16-byte aligned.  Thus add this kind of
information so that vector alignment hints can later on be emitted.

Bootstrapped and regtested on IBM Z.  Ok for master?

gcc/ChangeLog:

* config/s390/s390.c (s390_expand_vec_strlen): Add alignment
for memory access inside loop.
---
 gcc/config/s390/s390.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c
index dbb541bbea7..f9b27f96fd7 100644
--- a/gcc/config/s390/s390.c
+++ b/gcc/config/s390/s390.c
@@ -5955,6 +5955,7 @@ s390_expand_vec_strlen (rtx target, rtx string, rtx 
alignment)
   rtx temp;
   rtx len = gen_reg_rtx (QImode);
   rtx cond;
+  rtx mem;
 
   s390_load_address (str_addr_base_reg, XEXP (string, 0));
   emit_move_insn (str_idx_reg, const0_rtx);
@@ -5996,10 +5997,10 @@ s390_expand_vec_strlen (rtx target, rtx string, rtx 
alignment)
   LABEL_NUSES (loop_start_label) = 1;
 
   /* Load 16 bytes of the string into VR.  */
-  emit_move_insn (str_reg,
- gen_rtx_MEM (V16QImode,
-  gen_rtx_PLUS (Pmode, str_idx_reg,
-str_addr_base_reg)));
+  mem = gen_rtx_MEM (V16QImode,
+gen_rtx_PLUS (Pmode, str_idx_reg, str_addr_base_reg));
+  set_mem_align (mem, 128);
+  emit_move_insn (str_reg, mem);
   if (into_loop_label != NULL_RTX)
 {
   emit_label (into_loop_label);
-- 
2.25.3

Re: [PATCH] S/390: Do not turn maybe-uninitialized warnings into errors

2020-10-05 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Tue, Sep 22, 2020 at 02:59:30PM +0200, Andreas Krebbel wrote:
> On 15.09.20 17:02, Stefan Schulze Frielinghaus wrote:
> > Over the last couple of months quite a few warnings about uninitialized
> > variables were raised while building GCC.  A reason why these warnings
> > show up on S/390 only is due to the aggressive inlining settings here.
> > Some of these warnings (2c832ffedf0, b776bdca932, 2786c0221b6,
> > 1657178f59b) could be fixed or in case of a false positive silenced by
> > initializing the corresponding variable.  Since the latter reoccurs and
> > while bootstrapping such warnings are turned into errors bootstrapping
> > fails on S/390 consistently.  Therefore, for the moment do not turn
> > those warnings into errors.
> > 
> > config/ChangeLog:
> > 
> > * warnings.m4: Do not turn maybe-uninitialized warnings into errors
> > on S/390.
> > 
> > fixincludes/ChangeLog:
> > 
> > * configure: Regenerate.
> > 
> > gcc/ChangeLog:
> > 
> > * configure: Regenerate.
> > 
> > libcc1/ChangeLog:
> > 
> > * configure: Regenerate.
> > 
> > libcpp/ChangeLog:
> > 
> > * configure: Regenerate.
> > 
> > libdecnumber/ChangeLog:
> > 
> > * configure: Regenerate.
> 
> That change looks good to me. Could a global reviewer please comment!

Ping

> 
> Andreas
> 
> > ---
> >  config/warnings.m4 | 20 ++--
> >  fixincludes/configure  |  8 +++-
> >  gcc/configure  | 12 +---
> >  libcc1/configure   |  8 +++-
> >  libcpp/configure   |  8 +++-
> >  libdecnumber/configure |  8 +++-
> >  6 files changed, 51 insertions(+), 13 deletions(-)
> > 
> > diff --git a/config/warnings.m4 b/config/warnings.m4
> > index ce007f9b73e..d977bfb20af 100644
> > --- a/config/warnings.m4
> > +++ b/config/warnings.m4
> > @@ -101,8 +101,10 @@ AC_ARG_ENABLE(werror-always,
> >  AS_HELP_STRING([--enable-werror-always],
> >[enable -Werror despite compiler version]),
> >  [], [enable_werror_always=no])
> > -AS_IF([test $enable_werror_always = yes],
> > -  [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
> > +AS_IF([test $enable_werror_always = yes], [dnl
> > +  acx_Var="$acx_Var${acx_Var:+ }-Werror"
> > +  AS_CASE([$host], [s390*-*-*],
> > +  [acx_Var="$acx_Var -Wno-error=maybe-uninitialized"])])
> >   m4_if($1, [manual],,
> >   [AS_VAR_PUSHDEF([acx_GCCvers], [acx_cv_prog_cc_gcc_$1_or_newer])dnl
> >AC_CACHE_CHECK([whether $CC is GCC >=$1], acx_GCCvers,
> > @@ -116,7 +118,9 @@ AS_IF([test $enable_werror_always = yes],
> > [AS_VAR_SET(acx_GCCvers, yes)],
> > [AS_VAR_SET(acx_GCCvers, no)])])
> >   AS_IF([test AS_VAR_GET(acx_GCCvers) = yes],
> > -   [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
> > +   [acx_Var="$acx_Var${acx_Var:+ }-Werror"
> > +AS_CASE([$host], [s390*-*-*],
> > +[acx_Var="$acx_Var -Wno-error=maybe-uninitialized"])])
> >AS_VAR_POPDEF([acx_GCCvers])])
> >  m4_popdef([acx_Var])dnl
> >  AC_LANG_POP(C)
> > @@ -205,8 +209,10 @@ AC_ARG_ENABLE(werror-always,
> >  AS_HELP_STRING([--enable-werror-always],
> >[enable -Werror despite compiler version]),
> >  [], [enable_werror_always=no])
> > -AS_IF([test $enable_werror_always = yes],
> > -  [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
> > +AS_IF([test $enable_werror_always = yes], [dnl
> > +  acx_Var="$acx_Var${acx_Var:+ }-Werror"
> > +  AS_CASE([$host], [s390*-*-*],
> > +  [strict_warn="$strict_warn -Wno-error=maybe-uninitialized"])])
> >   m4_if($1, [manual],,
> >   [AS_VAR_PUSHDEF([acx_GXXvers], [acx_cv_prog_cxx_gxx_$1_or_newer])dnl
> >AC_CACHE_CHECK([whether $CXX is G++ >=$1], acx_GXXvers,
> > @@ -220,7 +226,9 @@ AS_IF([test $enable_werror_always = yes],
> > [AS_VAR_SET(acx_GXXvers, yes)],
> > [AS_VAR_SET(acx_GXXvers, no)])])
> >   AS_IF([test AS_VAR_GET(acx_GXXvers) = yes],
> > -   [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
> > +   [acx_Var="$acx_Var${acx_Var:+ }-Werror"
> > +AS_CASE([$host], [s390*-*-*],
> > +[acx_Var="$acx_Var -Wno-error=maybe-uninitialized"])])
> >AS_VAR_POPDEF([acx_GXXvers])])
> >  m4_popdef([acx_Var])dnl
> >  AC_LANG_POP(C++)
> > diff --git a/fixincludes/configure b/fixincludes/configure
> > index 6e2d67b655b..e0d679cc18e 100755
> > --- a/fixincludes/configure
> > +++ b/fixincludes/configure
> > @@ -4753,7 +4753,13 @@ else
> >  fi
> >  
> >  if test $enable_werror_always = yes; then :
> > -  WERROR="$WERROR${WERROR:+ }-Werror"
> > +WERROR="$WERROR${WERROR:+ }-Werror"
> > +  case $host in #(
> > +  s390*-*-*) :
> > +WERROR="$WERROR -Wno-error=maybe-uninitialized" ;; #(
> > +  *) :
> > + ;;
> > +esac
> >  fi
> >  
> >  ac_ext=c
> > diff --git a/gcc/configure b/gcc/configure
> > index 0a09777dd42..ea03581537a 100755
> > --- a/gcc/configure
> > +++ b/gcc/configure
> > @@ -7064,7 +7064,13 @@ else
> >  fi
> >  
> >  if test $enable_werror_always = yes; then :
> > -  strict_warn="$strict_warn${strict_warn:+ }-Werror"
> > +

Re: [PATCH] options: Save and restore opts_set for Optimization and Target options

2020-10-02 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Fri, Oct 02, 2020 at 10:46:33AM +0200, Jakub Jelinek wrote:
> On Wed, Sep 30, 2020 at 03:24:08PM +0200, Stefan Schulze Frielinghaus via 
> Gcc-patches wrote:
> > On Wed, Sep 30, 2020 at 01:39:11PM +0200, Jakub Jelinek wrote:
> > > On Wed, Sep 30, 2020 at 01:21:44PM +0200, Stefan Schulze Frielinghaus 
> > > wrote:
> > > > I think the problem boils down that on S/390 we distinguish between four
> > > > states of a flag: explicitely set to yes/no and implicitely set to
> > > > yes/no.  If set explicitely, the option wins.  For example, the options
> > > > `-march=z10 -mhtm` should enable the hardware transactional memory
> > > > option although z10 does not have one.  In the past if a flag was set or
> > > > not explicitely was encoded into opts_set->x_target_flags ... for each
> > > > flag individually, e.g. TARGET_OPT_HTM_P (opts_set->x_target_flags) was
> > > 
> > > Oops, seems I've missed that set_option has special treatment for
> > > CLVC_BIT_CLEAR/CLVC_BIT_SET.
> > > Which means I'll need to change the generic handling, so that for
> > > global_options_set elements mentioned in CLVC_BIT_* options are treated
> > > differently, instead of using the accumulated bitmasks they'll need to use
> > > their specific bitmask variables during the option saving/restoring.
> > > Is it ok if I defer it for tomorrow? Need to prepare for OpenMP meeting 
> > > now.
> > 
> > Sure, no problem at all.  In that case I stop to investigate further and
> > wait for you.
> 
> Here is a patch that implements that.
> 
> Can you please check if it fixes the s390x regressions that I couldn't
> reproduce in a cross?

Bootstrapped and regtested on S/390. Now all tattr-*.c test cases run
successfully with the patch. All other tests remain the same.

Thanks for the quick follow up!

Cheers,
Stefan

> 
> Bootstrapped/regtested on x86_64-linux and i686-linux so far.
> I don't have a convenient way to test it on the trunk on other
> architectures ATM, so I've just started testing a backport of the patchset to 
> 10
> on {x86_64,i686,powerpc64le,s390x,armv7hl,aarch64}-linux (though, don't
> intend to actually commit the backport).
> 
> 2020-10-02  Jakub Jelinek  
> 
>   * opth-gen.awk: For variables referenced in Mask and InverseMask,
>   don't use the explicit_mask bitmask array, but add separate
>   explicit_mask_* members with the same types as the variables.
>   * optc-save-gen.awk: Save, restore, compare and hash the separate
>   explicit_mask_* members.
> 
> --- gcc/opth-gen.awk.jj   2020-09-14 09:04:35.866854351 +0200
> +++ gcc/opth-gen.awk  2020-10-01 21:52:30.855122749 +0200
> @@ -209,6 +209,7 @@ n_target_int = 0;
>  n_target_enum = 0;
>  n_target_other = 0;
>  n_target_explicit = n_extra_target_vars;
> +n_target_explicit_mask = 0;
>  
>  for (i = 0; i < n_target_save; i++) {
>   if (target_save_decl[i] ~ "^((un)?signed +)?int +[_" alnum "]+$")
> @@ -240,6 +241,12 @@ if (have_save) {
>   var_save_seen[name]++;
>   n_target_explicit++;
>   otype = var_type_struct(flags[i])
> +
> + if (opt_args("Mask", flags[i]) != "" \
> + || opt_args("InverseMask", flags[i]))
> + 
> var_target_explicit_mask[n_target_explicit_mask++] \
> + = otype "explicit_mask_" name;
> +
>   if (otype ~ "^((un)?signed +)?int *$")
>   var_target_int[n_target_int++] = otype "x_" 
> name;
>  
> @@ -259,6 +266,8 @@ if (have_save) {
>  } else {
>   var_target_int[n_target_int++] = "int x_target_flags";
>   n_target_explicit++;
> + var_target_explicit_mask[n_target_explicit_mask++] \
> + = "int explicit_mask_target_flags";
>  }
>  
>  for (i = 0; i < n_target_other; i++) {
> @@ -281,8 +290,12 @@ for (i = 0; i < n_target_char; i++) {
>   print "  " var_target_char[i] ";";
>  }
>  
> -print "  /* " n_target_explicit " members */";
> -print "  unsigned HOST_WIDE_INT explicit_mask[" int ((n_target_explicit + 
> 63) / 64) "];";
> +print "  /* " n_target_explicit - n_target_explicit_mask " members */";
> +print "  unsigned HOST_WIDE_INT explicit_mask[" int ((n_target_explicit - 
> n_target_explicit_mask + 63) / 64) "];";
> +
> +for (i = 0; i < n_target_explicit_mask; i++) {

Re: [PATCH] options: Save and restore opts_set for Optimization and Target options

2020-09-30 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Wed, Sep 30, 2020 at 01:39:11PM +0200, Jakub Jelinek wrote:
> On Wed, Sep 30, 2020 at 01:21:44PM +0200, Stefan Schulze Frielinghaus wrote:
> > I think the problem boils down that on S/390 we distinguish between four
> > states of a flag: explicitely set to yes/no and implicitely set to
> > yes/no.  If set explicitely, the option wins.  For example, the options
> > `-march=z10 -mhtm` should enable the hardware transactional memory
> > option although z10 does not have one.  In the past if a flag was set or
> > not explicitely was encoded into opts_set->x_target_flags ... for each
> > flag individually, e.g. TARGET_OPT_HTM_P (opts_set->x_target_flags) was
> 
> Oops, seems I've missed that set_option has special treatment for
> CLVC_BIT_CLEAR/CLVC_BIT_SET.
> Which means I'll need to change the generic handling, so that for
> global_options_set elements mentioned in CLVC_BIT_* options are treated
> differently, instead of using the accumulated bitmasks they'll need to use
> their specific bitmask variables during the option saving/restoring.
> Is it ok if I defer it for tomorrow? Need to prepare for OpenMP meeting now.

Sure, no problem at all.  In that case I stop to investigate further and
wait for you.

Cheers,
Stefan

Re: [PATCH] options: Save and restore opts_set for Optimization and Target options

2020-09-30 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Wed, Sep 30, 2020 at 11:32:55AM +0200, Jakub Jelinek wrote:
> On Mon, Sep 28, 2020 at 09:50:00PM +0200, Stefan Schulze Frielinghaus via 
> Gcc-patches wrote:
> > This patch breaks quite a view test cases (target-attribute/tattr-*) on
> > IBM Z.  Having a look at function cl_target_option_restore reveals that
> > some members of opts_set are reduced to 1 or 0 depending on whether a
> > member was set before or not, e.g. for target_flags we have
> 
> I've tried to reproduce the tattr FAILs reported in
> https://gcc.gnu.org/pipermail/gcc-testresults/2020-September/608760.html
> in a cross-compiler (with
> #define HAVE_AS_MACHINE_MACHINEMODE 1
> ), but couldn't, neither the ICEs nor the scan-assembler failures.
> Anyway, could you do a side-by-side debugging of one of those failures
> before/after my change and see what behaves differently?

I think the problem boils down that on S/390 we distinguish between four
states of a flag: explicitely set to yes/no and implicitely set to
yes/no.  If set explicitely, the option wins.  For example, the options
`-march=z10 -mhtm` should enable the hardware transactional memory
option although z10 does not have one.  In the past if a flag was set or
not explicitely was encoded into opts_set->x_target_flags ... for each
flag individually, e.g. TARGET_OPT_HTM_P (opts_set->x_target_flags) was
used.  This has changed with the mentioned patch in the sense that
opts_set encodes whether any flag of x_target_flags was set or not but
not which individual one after a call to the generated function
cl_target_option_restore where we have:
opts_set->x_target_flags = (mask & 1) != 0;

Compiling the following program

#pragma GCC target ("arch=z10")
void fn_pragma_0 (void) { }

with options `-march=z13 -mzarch -mhtm -mdebug` produces different flags
for 4ac7b669580 (commit prior your patch) and ba948b37768 (your patch).

This is my current understanding of the option handling.  I will try to
come up with a trace where these things become hopefully more clear.

Cheers,
Stefan

Re: [PATCH] options: Save and restore opts_set for Optimization and Target options

2020-09-28 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Sun, Sep 13, 2020 at 10:29:22AM +0200, Jakub Jelinek via Gcc-patches wrote:
> On Fri, Sep 11, 2020 at 11:29:52AM +0200, Jakub Jelinek via Gcc-patches wrote:
> > On Fri, Sep 11, 2020 at 09:46:37AM +0200, Christophe Lyon via Gcc-patches 
> > wrote:
> > > I'm seeing an ICE with this new test on most of my arm configurations,
> > > for instance:
> > > --target arm-none-linux-gnueabi --with-cpu cortex-a9
> > > /aci-gcc-fsf/builds/gcc-fsf-gccsrc/obj-arm-none-linux-gnueabi/gcc3/gcc/xgcc
> > > -B/aci-gcc-fsf/builds/gcc-fsf-gccsrc/obj-ar
> > > m-none-linux-gnueabi/gcc3/gcc/ c_lto_pr96939_0.o c_lto_pr96939_1.o
> > > -fdiagnostics-plain-output -flto -O2 -o
> > > gcc-target-arm-lto-pr96939-01.exe
> > 
> > Seems a latent issue.
> > Neither cl_optimization_{save,restore} nor cl_target_option_{save,restore}
> > (nor any of the target hooks they call) saves or restores any opts_set
> > values, so I think opts_set can be trusted only during option processing (if
> > at all), but not later.
> > So, short term a fix would be IMHO just stop using opts_set altogether in
> > arm_configure_build_target, it doesn't make much sense to me, it should test
> > if those strings are non-NULL instead, or at least do that when it is
> > invoked from arm_option_restore (e.g. could be done by calling it with
> > opts instead of _options_set ).
> > Longer term, the question is if cl_optimization_{save,restore} and
> > cl_target_option_{save,restore} shouldn't be changed not to only
> > save/restore the options, but also save the opts_set flags.
> > It could be done e.g. by adding a bool array or set of bool members
> > to struct cl_optimization and struct cl_target_option , or even more compact
> > by using bitmasks, pack each 64 adjacent option flags into a UHWI element
> > of an array.
> 
> So, I've tried under debugger how it behaves and seems global_options_set
> is really an or of whether an option has been ever seen as explicit, either
> on the command line or in any of the option pragmas or optimize/target
> attributes seen so far, so it isn't something that can be relied on.
> 
> The following patch implements the saving/restoring of the opts_set bits
> (though only for the options/variables saved by the generic options-save.c
> code, for the target specific stuff that isn't handled by the generic code
> the opts_set argument is now passed to the hook and the backends can choose
> e.g. to use a TargetSave variable to save the flags either individually or
> together in some bitmask (or ignore it if they never need opts_set for the
> options). 
> 
> Bootstrapped/regtested on x86_64-linux, i686-linux, armv7hl-linux-gnueabi,
> aarch64-linux, powerpc64le-linux and lto bootstrapped on x86_64-linux, ok
> for trunk?

This patch breaks quite a view test cases (target-attribute/tattr-*) on
IBM Z.  Having a look at function cl_target_option_restore reveals that
some members of opts_set are reduced to 1 or 0 depending on whether a
member was set before or not, e.g. for target_flags we have

opts_set->x_target_flags = (mask & 1) != 0;

whereas previously those members where not touched by
cl_target_option_restore.

My intuition of this whole option evaluation is still pretty vague.
Basically opts_set is a set of options enabled via command line and/or
via pragmas/attributes whereas opts is the set of options which are
implied by opts_set.

What puzzles me right now is that in cl_target_option_save we save in
ptr only options from opts but not from opts_set whereas in
cl_target_option_restore we override some members of opts_set.  Thus it
is unclear to me how a backend should restore opts_set then.
I'm probably missing something.  Any hints on how to restore opts_set
and especially target_flags?

Cheers,
Stefan

Re: [PATCH] IBM Z: Try to make use of load-and-test instructions

2020-09-22 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Mon, Sep 21, 2020 at 06:51:00PM +0200, Andreas Krebbel wrote:
> On 18.09.20 13:10, Stefan Schulze Frielinghaus wrote:
> > This patch enables a peephole2 optimization which transforms a load of
> > constant zero into a temporary register which is then finally used to
> > compare against a floating-point register of interest into a single load
> > and test instruction.  However, the optimization is only applied if both
> > registers are dead afterwards and if we test for (in)equality only.
> > This is relaxed in case of fast math.
> > 
> > This is a follow up to PR88856.
> > 
> > Bootstrapped and regtested on IBM Z.
> > 
> > gcc/ChangeLog:
> > 
> > * config/s390/s390.md ("*cmp_ccs_0", "*cmp_ccz_0",
> > "*cmp_ccs_0_fastmath"): Basically change "*cmp_ccs_0" into
> > "*cmp_ccz_0" and for fast math add "*cmp_ccs_0_fastmath".
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * gcc.target/s390/load-and-test-fp-1.c: Change test to include all
> > possible combinations of dead/live registers and comparisons (equality,
> > relational).
> > * gcc.target/s390/load-and-test-fp-2.c: Same as load-and-test-fp-1.c
> > but for fast math.
> > * gcc.target/s390/load-and-test-fp.h: New test included by
> > load-and-test-fp-{1,2}.c.
> 
> Ok for mainline. Please see below for some comments.

Pushed with the mentioned changes in commit 1a84651d164.

Thanks for the review!

Cheers,
Stefan

> 
> Thanks!
> 
> Andreas
> 
> > ---
> >  gcc/config/s390/s390.md   | 54 +++
> >  .../gcc.target/s390/load-and-test-fp-1.c  | 19 +++
> >  .../gcc.target/s390/load-and-test-fp-2.c  | 17 ++
> >  .../gcc.target/s390/load-and-test-fp.h| 12 +
> >  4 files changed, 67 insertions(+), 35 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/s390/load-and-test-fp.h
> > 
> > diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
> > index 4c3e5400a2b..e591aa7c324 100644
> > --- a/gcc/config/s390/s390.md
> > +++ b/gcc/config/s390/s390.md
> > @@ -1391,23 +1391,55 @@
> >  ; (TF|DF|SF|TD|DD|SD) instructions
> >  
> >  
> > -; FIXME: load and test instructions turn SNaN into QNaN what is not
> > -; acceptable if the target will be used afterwards.  On the other hand
> > -; they are quite convenient for implementing comparisons with 0.0. So
> > -; try to enable them via splitter/peephole if the value isn't needed 
> > anymore.
> > -; See testcases: load-and-test-fp-1.c and load-and-test-fp-2.c
> > +; load and test instructions turn a signaling NaN into a quiet NaN.  Thus 
> > they
> > +; may only be used if the target register is dead afterwards or if fast 
> > math
> > +; is enabled.  The former is done via a peephole optimization.  Note, load 
> > and
> > +; test instructions may only be used for (in)equality comparisons because
> > +; relational comparisons must treat a quiet NaN like a signaling NaN which 
> > is
> > +; not the case for load and test instructions.  For fast math insn
> > +; "cmp_ccs_0_fastmath" applies.
> > +; See testcases load-and-test-fp-{1,2}.c
> > +
> > +(define_peephole2
> > +  [(set (match_operand:FP 0 "register_operand")
> > +   (match_operand:FP 1 "const0_operand"))
> > +   (set (reg:CCZ CC_REGNUM)
> > +   (compare:CCZ (match_operand:FP 2 "register_operand")
> > +(match_operand:FP 3 "register_operand")))]
> > +  "TARGET_HARD_FLOAT
> > +   && FP_REG_P (operands[2])
> > +   && REGNO (operands[0]) == REGNO (operands[3])
> > +   && peep2_reg_dead_p (2, operands[0])
> > +   && peep2_reg_dead_p (2, operands[2])"
> > +  [(parallel
> > +[(set (reg:CCZ CC_REGNUM)
> > + (match_op_dup 4 [(match_dup 2) (match_dup 1)]))
> > + (clobber (match_dup 2))])]
> > +  "operands[4] = gen_rtx_COMPARE (CCZmode, operands[2], operands[1]);")
> 
> Couldn't this be written as:
> 
>  [(parallel
> [(set (reg:CCZ CC_REGNUM)
> (compare:CCZ (match_dup 2) (match_dup 1)))
>  (clobber (match_dup 2))])])
> 
> >  
> >  ; ltxbr, ltdbr, ltebr, ltxtr, ltdtr
> > -(define_insn "*cmp_ccs_0"
> > -  [(set (reg CC_REGNUM)
> > -   (compare (match_operand:FP 0 "register_operand"  "f")
> > -(match_operand:FP 1 "const0_operand""")))
> > -   (clobber (match_operand:FP  2 "register_operand" "=0"))]
> > -  "s390_match_ccmode(insn, CCSmode) && TARGET_HARD_FLOAT"
> > +(define_insn "*cmp_ccz_0"
> > +  [(set (reg:CCZ CC_REGNUM)
> > +   (compare:CCZ (match_operand:FP 0 "register_operand" "f")
> > +(match_operand:FP 1 "const0_operand")))
> > +   (clobber (match_operand:FP 2 "register_operand" "=0"))]
> > +  "TARGET_HARD_FLOAT"
> >"ltr\t%0,%0"
> > [(set_attr "op_type" "RRE")
> >  (set_attr "type"  "fsimp")])
> >  
> > +(define_insn "*cmp_ccs_0_fastmath"
> > +  [(set (reg CC_REGNUM)
> > +   (compare (match_operand:FP 0 "register_operand" "f")
> > +(match_operand:FP 1 "const0_operand")))]
> > +  "s390_match_ccmode (insn, CCSmode)
> > +   && TARGET_HARD_FLOAT
> > +   &&

[PATCH] IBM Z: Try to make use of load-and-test instructions

2020-09-18 Thread Stefan Schulze Frielinghaus via Gcc-patches

This patch enables a peephole2 optimization which transforms a load of
constant zero into a temporary register which is then finally used to
compare against a floating-point register of interest into a single load
and test instruction.  However, the optimization is only applied if both
registers are dead afterwards and if we test for (in)equality only.
This is relaxed in case of fast math.

This is a follow up to PR88856.

Bootstrapped and regtested on IBM Z.

gcc/ChangeLog:

* config/s390/s390.md ("*cmp_ccs_0", "*cmp_ccz_0",
"*cmp_ccs_0_fastmath"): Basically change "*cmp_ccs_0" into
"*cmp_ccz_0" and for fast math add "*cmp_ccs_0_fastmath".

gcc/testsuite/ChangeLog:

* gcc.target/s390/load-and-test-fp-1.c: Change test to include all
possible combinations of dead/live registers and comparisons (equality,
relational).
* gcc.target/s390/load-and-test-fp-2.c: Same as load-and-test-fp-1.c
but for fast math.
* gcc.target/s390/load-and-test-fp.h: New test included by
load-and-test-fp-{1,2}.c.
---
 gcc/config/s390/s390.md   | 54 +++
 .../gcc.target/s390/load-and-test-fp-1.c  | 19 +++
 .../gcc.target/s390/load-and-test-fp-2.c  | 17 ++
 .../gcc.target/s390/load-and-test-fp.h| 12 +
 4 files changed, 67 insertions(+), 35 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/s390/load-and-test-fp.h

diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index 4c3e5400a2b..e591aa7c324 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -1391,23 +1391,55 @@
 ; (TF|DF|SF|TD|DD|SD) instructions
 
 
-; FIXME: load and test instructions turn SNaN into QNaN what is not
-; acceptable if the target will be used afterwards.  On the other hand
-; they are quite convenient for implementing comparisons with 0.0. So
-; try to enable them via splitter/peephole if the value isn't needed anymore.
-; See testcases: load-and-test-fp-1.c and load-and-test-fp-2.c
+; load and test instructions turn a signaling NaN into a quiet NaN.  Thus they
+; may only be used if the target register is dead afterwards or if fast math
+; is enabled.  The former is done via a peephole optimization.  Note, load and
+; test instructions may only be used for (in)equality comparisons because
+; relational comparisons must treat a quiet NaN like a signaling NaN which is
+; not the case for load and test instructions.  For fast math insn
+; "cmp_ccs_0_fastmath" applies.
+; See testcases load-and-test-fp-{1,2}.c
+
+(define_peephole2
+  [(set (match_operand:FP 0 "register_operand")
+   (match_operand:FP 1 "const0_operand"))
+   (set (reg:CCZ CC_REGNUM)
+   (compare:CCZ (match_operand:FP 2 "register_operand")
+(match_operand:FP 3 "register_operand")))]
+  "TARGET_HARD_FLOAT
+   && FP_REG_P (operands[2])
+   && REGNO (operands[0]) == REGNO (operands[3])
+   && peep2_reg_dead_p (2, operands[0])
+   && peep2_reg_dead_p (2, operands[2])"
+  [(parallel
+[(set (reg:CCZ CC_REGNUM)
+ (match_op_dup 4 [(match_dup 2) (match_dup 1)]))
+ (clobber (match_dup 2))])]
+  "operands[4] = gen_rtx_COMPARE (CCZmode, operands[2], operands[1]);")
 
 ; ltxbr, ltdbr, ltebr, ltxtr, ltdtr
-(define_insn "*cmp_ccs_0"
-  [(set (reg CC_REGNUM)
-   (compare (match_operand:FP 0 "register_operand"  "f")
-(match_operand:FP 1 "const0_operand""")))
-   (clobber (match_operand:FP  2 "register_operand" "=0"))]
-  "s390_match_ccmode(insn, CCSmode) && TARGET_HARD_FLOAT"
+(define_insn "*cmp_ccz_0"
+  [(set (reg:CCZ CC_REGNUM)
+   (compare:CCZ (match_operand:FP 0 "register_operand" "f")
+(match_operand:FP 1 "const0_operand")))
+   (clobber (match_operand:FP 2 "register_operand" "=0"))]
+  "TARGET_HARD_FLOAT"
   "ltr\t%0,%0"
[(set_attr "op_type" "RRE")
 (set_attr "type"  "fsimp")])
 
+(define_insn "*cmp_ccs_0_fastmath"
+  [(set (reg CC_REGNUM)
+   (compare (match_operand:FP 0 "register_operand" "f")
+(match_operand:FP 1 "const0_operand")))]
+  "s390_match_ccmode (insn, CCSmode)
+   && TARGET_HARD_FLOAT
+   && !flag_trapping_math
+   && !flag_signaling_nans"
+  "ltr\t%0,%0"
+  [(set_attr "op_type" "RRE")
+   (set_attr "type" "fsimp")])
+
 ; VX: TFmode in FPR pairs: use cxbr instead of wfcxb
 ; cxtr, cdtr, cxbr, cdbr, cebr, cdb, ceb, wfcsb, wfcdb
 (define_insn "*cmp_ccs"
diff --git a/gcc/testsuite/gcc.target/s390/load-and-test-fp-1.c 
b/gcc/testsuite/gcc.target/s390/load-and-test-fp-1.c
index 2a7e88c0f1b..ebb8a88c574 100644
--- a/gcc/testsuite/gcc.target/s390/load-and-test-fp-1.c
+++ b/gcc/testsuite/gcc.target/s390/load-and-test-fp-1.c
@@ -1,17 +1,12 @@
 /* { dg-do compile } */
 /* { dg-options "-O3 -mzarch" } */
 
-/* a is used after the comparison.  We cannot use load and test here
-   since it would turn SNaNs into QNaNs.  */
+/* Use load-and-test instructions if compared for (in)equality and if variable
+   `a` is

[PATCH] S/390: Do not turn maybe-uninitialized warnings into errors

2020-09-15 Thread Stefan Schulze Frielinghaus via Gcc-patches

Over the last couple of months quite a few warnings about uninitialized
variables were raised while building GCC.  A reason why these warnings
show up on S/390 only is due to the aggressive inlining settings here.
Some of these warnings (2c832ffedf0, b776bdca932, 2786c0221b6,
1657178f59b) could be fixed or in case of a false positive silenced by
initializing the corresponding variable.  Since the latter reoccurs and
while bootstrapping such warnings are turned into errors bootstrapping
fails on S/390 consistently.  Therefore, for the moment do not turn
those warnings into errors.

config/ChangeLog:

* warnings.m4: Do not turn maybe-uninitialized warnings into errors
on S/390.

fixincludes/ChangeLog:

* configure: Regenerate.

gcc/ChangeLog:

* configure: Regenerate.

libcc1/ChangeLog:

* configure: Regenerate.

libcpp/ChangeLog:

* configure: Regenerate.

libdecnumber/ChangeLog:

* configure: Regenerate.
---
 config/warnings.m4 | 20 ++--
 fixincludes/configure  |  8 +++-
 gcc/configure  | 12 +---
 libcc1/configure   |  8 +++-
 libcpp/configure   |  8 +++-
 libdecnumber/configure |  8 +++-
 6 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/config/warnings.m4 b/config/warnings.m4
index ce007f9b73e..d977bfb20af 100644
--- a/config/warnings.m4
+++ b/config/warnings.m4
@@ -101,8 +101,10 @@ AC_ARG_ENABLE(werror-always,
 AS_HELP_STRING([--enable-werror-always],
   [enable -Werror despite compiler version]),
 [], [enable_werror_always=no])
-AS_IF([test $enable_werror_always = yes],
-  [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
+AS_IF([test $enable_werror_always = yes], [dnl
+  acx_Var="$acx_Var${acx_Var:+ }-Werror"
+  AS_CASE([$host], [s390*-*-*],
+  [acx_Var="$acx_Var -Wno-error=maybe-uninitialized"])])
  m4_if($1, [manual],,
  [AS_VAR_PUSHDEF([acx_GCCvers], [acx_cv_prog_cc_gcc_$1_or_newer])dnl
   AC_CACHE_CHECK([whether $CC is GCC >=$1], acx_GCCvers,
@@ -116,7 +118,9 @@ AS_IF([test $enable_werror_always = yes],
[AS_VAR_SET(acx_GCCvers, yes)],
[AS_VAR_SET(acx_GCCvers, no)])])
  AS_IF([test AS_VAR_GET(acx_GCCvers) = yes],
-   [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
+   [acx_Var="$acx_Var${acx_Var:+ }-Werror"
+AS_CASE([$host], [s390*-*-*],
+[acx_Var="$acx_Var -Wno-error=maybe-uninitialized"])])
   AS_VAR_POPDEF([acx_GCCvers])])
 m4_popdef([acx_Var])dnl
 AC_LANG_POP(C)
@@ -205,8 +209,10 @@ AC_ARG_ENABLE(werror-always,
 AS_HELP_STRING([--enable-werror-always],
   [enable -Werror despite compiler version]),
 [], [enable_werror_always=no])
-AS_IF([test $enable_werror_always = yes],
-  [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
+AS_IF([test $enable_werror_always = yes], [dnl
+  acx_Var="$acx_Var${acx_Var:+ }-Werror"
+  AS_CASE([$host], [s390*-*-*],
+  [strict_warn="$strict_warn -Wno-error=maybe-uninitialized"])])
  m4_if($1, [manual],,
  [AS_VAR_PUSHDEF([acx_GXXvers], [acx_cv_prog_cxx_gxx_$1_or_newer])dnl
   AC_CACHE_CHECK([whether $CXX is G++ >=$1], acx_GXXvers,
@@ -220,7 +226,9 @@ AS_IF([test $enable_werror_always = yes],
[AS_VAR_SET(acx_GXXvers, yes)],
[AS_VAR_SET(acx_GXXvers, no)])])
  AS_IF([test AS_VAR_GET(acx_GXXvers) = yes],
-   [acx_Var="$acx_Var${acx_Var:+ }-Werror"])
+   [acx_Var="$acx_Var${acx_Var:+ }-Werror"
+AS_CASE([$host], [s390*-*-*],
+[acx_Var="$acx_Var -Wno-error=maybe-uninitialized"])])
   AS_VAR_POPDEF([acx_GXXvers])])
 m4_popdef([acx_Var])dnl
 AC_LANG_POP(C++)
diff --git a/fixincludes/configure b/fixincludes/configure
index 6e2d67b655b..e0d679cc18e 100755
--- a/fixincludes/configure
+++ b/fixincludes/configure
@@ -4753,7 +4753,13 @@ else
 fi
 
 if test $enable_werror_always = yes; then :
-  WERROR="$WERROR${WERROR:+ }-Werror"
+WERROR="$WERROR${WERROR:+ }-Werror"
+  case $host in #(
+  s390*-*-*) :
+WERROR="$WERROR -Wno-error=maybe-uninitialized" ;; #(
+  *) :
+ ;;
+esac
 fi
 
 ac_ext=c
diff --git a/gcc/configure b/gcc/configure
index 0a09777dd42..ea03581537a 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -7064,7 +7064,13 @@ else
 fi
 
 if test $enable_werror_always = yes; then :
-  strict_warn="$strict_warn${strict_warn:+ }-Werror"
+strict_warn="$strict_warn${strict_warn:+ }-Werror"
+  case $host in #(
+  s390*-*-*) :
+strict_warn="$strict_warn -Wno-error=maybe-uninitialized" ;; #(
+  *) :
+ ;;
+esac
 fi
 
 ac_ext=cpp
@@ -19013,7 +19019,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 19016 "configure"
+#line 19022 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -19119,7 +19125,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 19122 "configure"
+#line 19128 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
diff --git

Re: [PATCH] [RFC] vect: Fix infinite loop while determining peeling amount

2020-07-29 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Wed, Jul 29, 2020 at 09:11:12AM +0200, Richard Biener wrote:
> On Tue, Jul 28, 2020 at 5:36 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Tue, Jul 28, 2020 at 08:55:57AM +0200, Richard Biener wrote:
> > > On Mon, Jul 27, 2020 at 4:20 PM Stefan Schulze Frielinghaus
> > >  wrote:
> > > >
> > > > On Mon, Jul 27, 2020 at 12:29:11PM +0200, Richard Biener wrote:
> > > > > On Mon, Jul 27, 2020 at 11:45 AM Richard Sandiford
> > > > >  wrote:
> > > > > >
> > > > > > Richard Biener  writes:
> > > > > > > On Mon, Jul 27, 2020 at 11:09 AM Richard Sandiford
> > > > > > >  wrote:
> > > > > > >>
> > > > > > >> Richard Biener via Gcc-patches  writes:
> > > > > > >> > On Wed, Jul 22, 2020 at 5:18 PM Stefan Schulze Frielinghaus via
> > > > > > >> > Gcc-patches  wrote:
> > > > > > >> >>
> > > > > > >> >> This is a follow up to commit 5c9669a0e6c respectively 
> > > > > > >> >> discussion
> > > > > > >> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/549132.html
> > > > > > >> >>
> > > > > > >> >> In case that an alignment constraint is less than the size of 
> > > > > > >> >> a
> > > > > > >> >> corresponding scalar type, ensure that we advance at least by 
> > > > > > >> >> one
> > > > > > >> >> iteration.  For example, on s390x we have for a long double 
> > > > > > >> >> an alignment
> > > > > > >> >> constraint of 8 bytes whereas the size is 16 bytes.  
> > > > > > >> >> Therefore,
> > > > > > >> >> TARGET_ALIGN / DR_SIZE equals zero resulting in an infinite 
> > > > > > >> >> loop which
> > > > > > >> >> can be reproduced by the following MWE:
> > > > > > >> >
> > > > > > >> > But we guard this case with vector_alignment_reachable_p, so 
> > > > > > >> > we shouldn't
> > > > > > >> > have ended up here and the patch looks bogus.
> > > > > > >>
> > > > > > >> The above sounds like it ought to count as reachable alignment 
> > > > > > >> though.
> > > > > > >> If a type requires a lower alignment than its size, then that's 
> > > > > > >> even
> > > > > > >> more easily reachable than a type that requires the same 
> > > > > > >> alignment as
> > > > > > >> the size.  I guess at one extreme, a target alignment of 1 is 
> > > > > > >> always
> > > > > > >> reachable.
> > > > > > >
> > > > > > > Well, if the element alignment is 8 but its size is 16 then when 
> > > > > > > presumably
> > > > > > > the desired vector alignment is a multiple of 16 we can never 
> > > > > > > reach it.
> > > > > > > Isn't this the case here?
> > > > > >
> > > > > > If the desired vector alignment (TARGET_ALIGN) is a multiple of 16 
> > > > > > then
> > > > > > TARGET_ALIGN / DR_SIZE will be nonzero and the problem the patch is
> > > > > > fixing wouldn't occur.  I agree that we might never be able to reach
> > > > > > that alignment if the pointer starts out misaligned by 8 bytes.
> > > > > >
> > > > > > But I think that's why it makes sense for the target to only ask
> > > > > > for 8-byte alignment for vectors too, if it can cope with it.  
> > > > > > 8-byte
> > > > > > alignment should always be achievable if the scalars are 
> > > > > > ABI-aligned.
> > > > > > And if the target does ask for only 8-byte alignment, TARGET_ALIGN /
> > > > > > DR_SIZE would be zero and the loop would never progress, which is 
> > > > > > the
> > > > > > problem that the patch is fixing.
> > > > > >
> > > > > > It would even make sense for the target to ask for 1-byte alignment,
> > > > > > if the target

Re: [PATCH] [RFC] vect: Fix infinite loop while determining peeling amount

2020-07-28 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Tue, Jul 28, 2020 at 08:55:57AM +0200, Richard Biener wrote:
> On Mon, Jul 27, 2020 at 4:20 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Mon, Jul 27, 2020 at 12:29:11PM +0200, Richard Biener wrote:
> > > On Mon, Jul 27, 2020 at 11:45 AM Richard Sandiford
> > >  wrote:
> > > >
> > > > Richard Biener  writes:
> > > > > On Mon, Jul 27, 2020 at 11:09 AM Richard Sandiford
> > > > >  wrote:
> > > > >>
> > > > >> Richard Biener via Gcc-patches  writes:
> > > > >> > On Wed, Jul 22, 2020 at 5:18 PM Stefan Schulze Frielinghaus via
> > > > >> > Gcc-patches  wrote:
> > > > >> >>
> > > > >> >> This is a follow up to commit 5c9669a0e6c respectively discussion
> > > > >> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/549132.html
> > > > >> >>
> > > > >> >> In case that an alignment constraint is less than the size of a
> > > > >> >> corresponding scalar type, ensure that we advance at least by one
> > > > >> >> iteration.  For example, on s390x we have for a long double an 
> > > > >> >> alignment
> > > > >> >> constraint of 8 bytes whereas the size is 16 bytes.  Therefore,
> > > > >> >> TARGET_ALIGN / DR_SIZE equals zero resulting in an infinite loop 
> > > > >> >> which
> > > > >> >> can be reproduced by the following MWE:
> > > > >> >
> > > > >> > But we guard this case with vector_alignment_reachable_p, so we 
> > > > >> > shouldn't
> > > > >> > have ended up here and the patch looks bogus.
> > > > >>
> > > > >> The above sounds like it ought to count as reachable alignment 
> > > > >> though.
> > > > >> If a type requires a lower alignment than its size, then that's even
> > > > >> more easily reachable than a type that requires the same alignment as
> > > > >> the size.  I guess at one extreme, a target alignment of 1 is always
> > > > >> reachable.
> > > > >
> > > > > Well, if the element alignment is 8 but its size is 16 then when 
> > > > > presumably
> > > > > the desired vector alignment is a multiple of 16 we can never reach 
> > > > > it.
> > > > > Isn't this the case here?
> > > >
> > > > If the desired vector alignment (TARGET_ALIGN) is a multiple of 16 then
> > > > TARGET_ALIGN / DR_SIZE will be nonzero and the problem the patch is
> > > > fixing wouldn't occur.  I agree that we might never be able to reach
> > > > that alignment if the pointer starts out misaligned by 8 bytes.
> > > >
> > > > But I think that's why it makes sense for the target to only ask
> > > > for 8-byte alignment for vectors too, if it can cope with it.  8-byte
> > > > alignment should always be achievable if the scalars are ABI-aligned.
> > > > And if the target does ask for only 8-byte alignment, TARGET_ALIGN /
> > > > DR_SIZE would be zero and the loop would never progress, which is the
> > > > problem that the patch is fixing.
> > > >
> > > > It would even make sense for the target to ask for 1-byte alignment,
> > > > if the target doesn't care about alignment at all.
> > >
> > > Hmm, OK.  Guess I still think we should detect this somewhere upward
> > > and avoid this peeling compute at all.  Somehow.
> >
> > I've been playing around with another solution which works for me by
> > changing vector_alignment_reachable_p to return also false if the
> > alignment requirements are already satisfied, i.e., by adding:
> >
> > if (known_alignment_for_access_p (dr_info) && aligned_access_p (dr_info))
> >   return false;
> 
> That sounds wrong, instead ...

Can you elaborate on that?  A similar test exists for predicate
vector_alignment_reachable_p where the second conjunct is the same but
negated in order to test for the case where a misalignment is known:
https://gcc.gnu.org/git?p=gcc.git;a=blob;f=gcc/tree-vect-data-refs.c;h=e35a215e042478d11d6545f1f829d816d0c3620f;hb=refs/heads/master#l1263
Therefore, I'm wondering why the non-negated case should be wrong.

> > Though, I'm not entirely sure whether this makes it better or not.
> > Strictly speaking if the alignment was reachable before peeling, then
> > reaching alignment with peeling is also possible but probably not what
> > was intended.  So I guess returning false in this case is sensible.  Any
> > comments?
> 
> ... why is the DR considered for peeling at all?  If it is already
> aligned there's
> no point to do that.

Isn't the whole point of vector_alignment_reachable_p to check DRs in
order to decide whether peeling should be done or not?  At least this is
my intuition and the reason why I was suggesting to return false in case
it is aligned.

Cheers,
Stefan

> If we want to align another DR then the loop you fix
> should run on that DRs align/size, no?
> 
> Richard.
> 
> > Thanks,
> > Stefan
> >
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Richard

Re: [PATCH] [RFC] vect: Fix infinite loop while determining peeling amount

2020-07-27 Thread Stefan Schulze Frielinghaus via Gcc-patches

On Mon, Jul 27, 2020 at 12:29:11PM +0200, Richard Biener wrote:
> On Mon, Jul 27, 2020 at 11:45 AM Richard Sandiford
>  wrote:
> >
> > Richard Biener  writes:
> > > On Mon, Jul 27, 2020 at 11:09 AM Richard Sandiford
> > >  wrote:
> > >>
> > >> Richard Biener via Gcc-patches  writes:
> > >> > On Wed, Jul 22, 2020 at 5:18 PM Stefan Schulze Frielinghaus via
> > >> > Gcc-patches  wrote:
> > >> >>
> > >> >> This is a follow up to commit 5c9669a0e6c respectively discussion
> > >> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/549132.html
> > >> >>
> > >> >> In case that an alignment constraint is less than the size of a
> > >> >> corresponding scalar type, ensure that we advance at least by one
> > >> >> iteration.  For example, on s390x we have for a long double an 
> > >> >> alignment
> > >> >> constraint of 8 bytes whereas the size is 16 bytes.  Therefore,
> > >> >> TARGET_ALIGN / DR_SIZE equals zero resulting in an infinite loop which
> > >> >> can be reproduced by the following MWE:
> > >> >
> > >> > But we guard this case with vector_alignment_reachable_p, so we 
> > >> > shouldn't
> > >> > have ended up here and the patch looks bogus.
> > >>
> > >> The above sounds like it ought to count as reachable alignment though.
> > >> If a type requires a lower alignment than its size, then that's even
> > >> more easily reachable than a type that requires the same alignment as
> > >> the size.  I guess at one extreme, a target alignment of 1 is always
> > >> reachable.
> > >
> > > Well, if the element alignment is 8 but its size is 16 then when 
> > > presumably
> > > the desired vector alignment is a multiple of 16 we can never reach it.
> > > Isn't this the case here?
> >
> > If the desired vector alignment (TARGET_ALIGN) is a multiple of 16 then
> > TARGET_ALIGN / DR_SIZE will be nonzero and the problem the patch is
> > fixing wouldn't occur.  I agree that we might never be able to reach
> > that alignment if the pointer starts out misaligned by 8 bytes.
> >
> > But I think that's why it makes sense for the target to only ask
> > for 8-byte alignment for vectors too, if it can cope with it.  8-byte
> > alignment should always be achievable if the scalars are ABI-aligned.
> > And if the target does ask for only 8-byte alignment, TARGET_ALIGN /
> > DR_SIZE would be zero and the loop would never progress, which is the
> > problem that the patch is fixing.
> >
> > It would even make sense for the target to ask for 1-byte alignment,
> > if the target doesn't care about alignment at all.
> 
> Hmm, OK.  Guess I still think we should detect this somewhere upward
> and avoid this peeling compute at all.  Somehow.

I've been playing around with another solution which works for me by
changing vector_alignment_reachable_p to return also false if the
alignment requirements are already satisfied, i.e., by adding:

if (known_alignment_for_access_p (dr_info) && aligned_access_p (dr_info))
  return false;

Though, I'm not entirely sure whether this makes it better or not.
Strictly speaking if the alignment was reachable before peeling, then
reaching alignment with peeling is also possible but probably not what
was intended.  So I guess returning false in this case is sensible.  Any
comments?

Thanks,
Stefan

> 
> Richard.
> 
> > Thanks,
> > Richard

[PATCH] [RFC] vect: Fix infinite loop while determining peeling amount

2020-07-22 Thread Stefan Schulze Frielinghaus via Gcc-patches

This is a follow up to commit 5c9669a0e6c respectively discussion
https://gcc.gnu.org/pipermail/gcc-patches/2020-June/549132.html

In case that an alignment constraint is less than the size of a
corresponding scalar type, ensure that we advance at least by one
iteration.  For example, on s390x we have for a long double an alignment
constraint of 8 bytes whereas the size is 16 bytes.  Therefore,
TARGET_ALIGN / DR_SIZE equals zero resulting in an infinite loop which
can be reproduced by the following MWE:

extern long double *a;
extern double *b;
void fun(void) {
  for (int i = 0; i < 42; i++)
a[i] = b[i];
}

Increasing the number of peelings in each iteration at least by one
fixes the issue for me.  Any comments?

Bootstrapped and regtested on s390x.

gcc/ChangeLog:

* tree-vect-data-refs.c (vect_enhance_data_refs_alignment):
Ensure that loop variable npeel_tmp advances in each iteration.
---
 gcc/tree-vect-data-refs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index e35a215e042..a78ae61d1b0 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -1779,7 +1779,7 @@ vect_enhance_data_refs_alignment (loop_vec_info 
loop_vinfo)
 {
   vect_peeling_hash_insert (_htab, loop_vinfo,
dr_info, npeel_tmp);
- npeel_tmp += target_align / dr_size;
+ npeel_tmp += MAX (1, target_align / dr_size);
 }
 
  one_misalignment_known = true;
-- 
2.25.3

[PATCH 2/2] S/390: Emit vector alignment hints for z13 if AS accepts them

2020-07-15 Thread Stefan Schulze Frielinghaus via Gcc-patches

gcc/ChangeLog:

* config.in: Regenerate.
* config/s390/s390.c (print_operand): Emit vector alignment hints
for target z13, if AS accepts them.  For other targets the logic
stays the same.
* config/s390/s390.h (TARGET_VECTOR_LOADSTORE_ALIGNMENT_HINTS): Define
macro.
* configure: Regenerate.
* configure.ac: Check HAVE_AS_VECTOR_LOADSTORE_ALIGNMENT_HINTS_ON_Z13.

gcc/testsuite/ChangeLog:

* gcc.target/s390/vector/align-1.c: Change target architecture
to z13.
* gcc.target/s390/vector/align-2.c: Change target architecture
to z13.

(cherry picked from commit 929fd91ba975eebf9e57f7f092041271dcaf0c34)
(squashed with commit 87cb9423add08743d8bb3368f0af61ddc9572837)
---
 gcc/config.in |  7 +
 gcc/config/s390/s390.c|  4 +--
 gcc/config/s390/s390.h|  7 +
 gcc/configure | 31 +++
 gcc/configure.ac  |  5 +++
 .../gcc.target/s390/vector/align-1.c  |  2 +-
 .../gcc.target/s390/vector/align-2.c  |  2 +-
 7 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/gcc/config.in b/gcc/config.in
index 4924b8a0c32..051e6afb097 100644
--- a/gcc/config.in
+++ b/gcc/config.in
@@ -724,6 +724,13 @@
 #endif
 
 
+/* Define if your assembler supports vl/vst/vlm/vstm with an optional
+   alignment hint argument on z13. */
+#ifndef USED_FOR_TARGET
+#undef HAVE_AS_VECTOR_LOADSTORE_ALIGNMENT_HINTS_ON_Z13
+#endif
+
+
 /* Define if your assembler supports VSX instructions. */
 #ifndef USED_FOR_TARGET
 #undef HAVE_AS_VSX
diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c
index 5aff2084e1b..9057154be07 100644
--- a/gcc/config/s390/s390.c
+++ b/gcc/config/s390/s390.c
@@ -7737,15 +7737,13 @@ print_operand (FILE *file, rtx x, int code)
   switch (code)
 {
 case 'A':
-#ifdef HAVE_AS_VECTOR_LOADSTORE_ALIGNMENT_HINTS
-  if (TARGET_ARCH12 && MEM_P (x))
+  if (TARGET_VECTOR_LOADSTORE_ALIGNMENT_HINTS && MEM_P (x))
{
  if (MEM_ALIGN (x) >= 128)
fprintf (file, ",4");
  else if (MEM_ALIGN (x) == 64)
fprintf (file, ",3");
}
-#endif
   return;
 case 'C':
   fprintf (file, s390_branch_condition_mnemonic (x, FALSE));
diff --git a/gcc/config/s390/s390.h b/gcc/config/s390/s390.h
index 71a12b8c92e..c5307755aa1 100644
--- a/gcc/config/s390/s390.h
+++ b/gcc/config/s390/s390.h
@@ -154,6 +154,13 @@ enum processor_flags
(TARGET_VX && TARGET_CPU_VXE)
 #define TARGET_VXE_P(opts) \
(TARGET_VX_P (opts) && TARGET_CPU_VXE_P (opts))
+#if defined(HAVE_AS_VECTOR_LOADSTORE_ALIGNMENT_HINTS_ON_Z13)
+#define TARGET_VECTOR_LOADSTORE_ALIGNMENT_HINTS TARGET_Z13
+#elif defined(HAVE_AS_VECTOR_LOADSTORE_ALIGNMENT_HINTS)
+#define TARGET_VECTOR_LOADSTORE_ALIGNMENT_HINTS TARGET_ARCH12
+#else
+#define TARGET_VECTOR_LOADSTORE_ALIGNMENT_HINTS 0
+#endif
 
 #ifdef HAVE_AS_MACHINE_MACHINEMODE
 #define S390_USE_TARGET_ATTRIBUTE 1
diff --git a/gcc/configure b/gcc/configure
index 4dd81d24241..aa37763d6d4 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -27786,6 +27786,37 @@ if test 
$gcc_cv_as_s390_vector_loadstore_alignment_hints = yes; then
 
 $as_echo "#define HAVE_AS_VECTOR_LOADSTORE_ALIGNMENT_HINTS 1" >>confdefs.h
 
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking assembler for vector 
load/store alignment hints on z13" >&5
+$as_echo_n "checking assembler for vector load/store alignment hints on z13... 
" >&6; }
+if ${gcc_cv_as_s390_vector_loadstore_alignment_hints_on_z13+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  gcc_cv_as_s390_vector_loadstore_alignment_hints_on_z13=no
+  if test x$gcc_cv_as != x; then
+$as_echo ' vl %v24,0(%r15),3 ' > conftest.s
+if { ac_try='$gcc_cv_as $gcc_cv_as_flags -mzarch -march=z13 -o conftest.o 
conftest.s >&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }
+then
+   gcc_cv_as_s390_vector_loadstore_alignment_hints_on_z13=yes
+else
+  echo "configure: failed program was" >&5
+  cat conftest.s >&5
+fi
+rm -f conftest.o conftest.s
+  fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$gcc_cv_as_s390_vector_loadstore_alignment_hints_on_z13" >&5
+$as_echo "$gcc_cv_as_s390_vector_loadstore_alignment_hints_on_z13" >&6; }
+if test $gcc_cv_as_s390_vector_loadstore_alignment_hints_on_z13 = yes; then
+
+$as_echo "#define HAVE_AS_VECTOR_LOADSTORE_ALIGNMENT_HINTS_ON_Z13 1" 
>>confdefs.h
+
 fi
 
 
diff --git a/gcc/configure.ac b/gcc/configure.ac
index 6173a1c4f23..a3211db36c0 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -4883,6 +4883,11 @@ pointers into PC-relative form.])
   [vl %v24,0(%r15),3 ],,

1 2 >

1 - 100 of 131 matches

Mail list logo