[Bug tree-optimization/90106] builtin sqrt() ignoring libm's sqrt call result

2019-05-20 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106

--- Comment #19 from JunMa  ---
we can skip the target by adding 
/* { dg-skip-if "need hardfp abi" { *-*-* } { "-mfloat-abi=soft" } { "" } } */
to testcase.

[Bug tree-optimization/90106] builtin sqrt() ignoring libm's sqrt call result

2019-05-20 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106

--- Comment #17 from JunMa  ---
(In reply to Christophe Lyon from comment #16)
> That's what I did... (use -fdump-tree-cdce-details).
> 
> The assembler code is:
> .arm
> .fpu softvfp
> .type   foo, %function
> foo:
> @ args = 0, pretend = 0, frame = 0
> @ frame_needed = 0, uses_anonymous_args = 0
> @ link register save eliminated.
> b   sqrtf
> 
> which is a tail call ('b' is a jump instruction on arm, not a call).

Hmm... I think the testcase works with -mfpu=vfp -mfloat-abi=hard.

[Bug tree-optimization/90106] builtin sqrt() ignoring libm's sqrt call result

2019-05-20 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106

--- Comment #15 from JunMa  ---
(In reply to Christophe Lyon from comment #14)
> Sure, here is the contents of cdce3.c.105t.cdce:
> 
> ;; Function foo (foo, funcdef_no=0, decl_uid=4197, cgraph_uid=1,
> symbol_order=0)
> 
> foo (float x)
> {
>   float _4;
> 
>[local count: 1073741824]:
>   _4 = sqrtf (x_2(D));
>   return _4;
> 
> }

The contents are as same as cdce3.c.104t.stdarg. Would you please dump it whit
-fdump-tree-cdce-details?

Then we can see(in x86_64)
   Found conditional dead call: _4 = sqrtf (x_2(D));

   cdce3.c:9: note: function call is shrink-wrapped into error conditions.

It seems that gcc thinks arm doesn't support direct internal function (maybe
vsqrt instruction ?) of sqrtf .

[Bug tree-optimization/90106] builtin sqrt() ignoring libm's sqrt call result

2019-05-20 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106

--- Comment #13 from JunMa  ---
(In reply to Christophe Lyon from comment #12)
> This new test fails on arm:
> FAIL: gcc.dg/cdce3.c scan-tree-dump cdce "cdce3.c:9: [^\n\r]* function call
> is shrink-wrapped into error conditions."

I don't have arm environment. Would you please attach the dump file of cdce
pass?

[Bug middle-end/90514] Issue about enum type in gcc tree

2019-05-19 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90514

--- Comment #4 from JunMa  ---
(In reply to Andrew Pinski from comment #3)
> (In reply to JunMa from comment #2) 
> > I had got confused by the comments in vrp pass. the condition
> >   if ((kind != ENUM1) && (kind != ENUM2))
> > is not always false, and cannot be folded to if (0). 
> > Also the code deals with pr23046 is out of data, and should be removed.
> 
> That was C++ code rather than C code and C++ which has different rules than
> C (at least with -fstrict-enums).  The code is not out of date for gimple
> types.  Since the types in Gimple can have more well defined behaviors which
> are not exposed via C or C++ front-ends.  NOTE there could be an Ada code
> which hits the same failure (I don't know Ada that well but I do know the
> Ada front-end supports more features of the GCC middle-end than either of
> the C or C++ front-ends).

Since gimple folding has changed so much, we can easily support folding it in
match.pd rather than checking this in here.

[Bug testsuite/90517] [10 regression] test case gcc.dg/cdce1.c fails (unresolved) starting with r271281

2019-05-17 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90517

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #2 from JunMa  ---
(In reply to Jakub Jelinek from comment #1)
> See http://gcc.gnu.org/ml/gcc-patches/2019-05/msg01024.html
> Waiting for review on that.

LGTM, and thanks for fix the testcases

[Bug middle-end/90514] Issue about enum type in gcc tree

2019-05-17 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90514

--- Comment #2 from JunMa  ---
(In reply to Andrew Pinski from comment #1)
> Are you saying the precision should be 1?  If so then no, that would be
> invalid as in C, enum have the full range of the underlying type and is well
> defined to have values of 3 or higher in the enum variable.

Thanks for explain. 

I had got confused by the comments in vrp pass. the condition
  if ((kind != ENUM1) && (kind != ENUM2))
is not always false, and cannot be folded to if (0). 
Also the code deals with pr23046 is out of data, and should be removed.

[Bug middle-end/90514] New: Issue about enum type in gcc tree

2019-05-16 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90514

Bug ID: 90514
   Summary: Issue about enum type in gcc tree
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: JunMa at linux dot alibaba.com
  Target Milestone: ---

For case pr23046.c:

enum eumtype { ENUM1, ENUM2 };
void g(const enum eumtype kind );
void f(long i);
void g(const enum eumtype kind)
  {
if ((kind != ENUM1) && (kind != ENUM2))
  f(kind);
  }

 command: gcc -O2  test.c
 and when I dumped kind, I found:

 
unit-size 
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x70a93738 precision:32 min  max 
values 
value 
chain 
value >> context
>
visited var 
def_stmt GIMPLE_NOP
version:3>

It looks weird to me, since I think that min/max/precision of TREE_TYPE(kind)
is inverted with TREE_TYPE(TREE_TYPE(kind)).

This cause vrp pass get wrong range info. Also, the code which checks enum type
is out of data.

[Bug tree-optimization/90437] Overflow detection too late for VRP

2019-05-16 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90437

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #2 from JunMa  ---
(In reply to Richard Biener from comment #1)
> VRP obviously only sees a + b in [0, 20] and [0, 20] < [0, 10] as unknown.

we do have pattern x+y < y in match.pd, but it only worked with
TYPE_OVERFLOW_UNDEFINED. 

I'm not sure wether we can use range info in match.pd.

[Bug tree-optimization/90387] [9 Regression] __builtin_constant_p and -Warray-bounds warnings

2019-05-10 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90387

--- Comment #4 from JunMa  ---
LGTM

[Bug tree-optimization/90387] [9 Regression] __builtin_constant_p and -Warray-bounds warnings

2019-05-09 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90387

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #2 from JunMa  ---
VRP tries to fold builtin_constant_p when its argument is a function parameter.
builtin_constant_p should be removed as dead stmt in this case no matter  "#if
1 " or "#if 0", since p_len is function parameter. 

When "#if 1" turns true, vrp pass inserts ASSERT_EXPR to infer value range of
p_len, this changes argument of builtin_constant_p from function parameter to
result of ASSERT_EXPR which breaks the rule.

[Bug tree-optimization/90106] builtin sqrt() ignoring libm's sqrt call result

2019-04-24 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106

--- Comment #9 from JunMa  ---
(In reply to JunMa from comment #7)
> yes, the transformation in CDEC prevent the tail call optimization. let's
> check the return stmt in CDEC pass.

Sorry for the confused comment. 

As the discussion above, The cdce pass looks for calls to built-in functions
that set errno and whose result is used. It tries to transform these calls into
conditionally executes calls with a simple range check on the arguments which
can detect most cases and the errno does not need to be set. The transform
looks like:

y = sqrt (x);
 ==>
y = IFN_SQRT (x);
if (__builtin_isless (x, 0))
sqrt (x);

However when the call is in tail position, this transformation breaks  tailcall
optimizations, since the conditionally call does not have return value. This is
what this PR tries to explain and fix.

Alexander gives two suggestions:
first:
y = IFN_SQRT (x);
if (__builtin_isless (x, 0))
y = sqrt (x);

second(LLVM's approach):

if (__builtin_isless (x, 0))
y = sqrt (x);
else
y = IFN_SQRT (x);


So what I want to do here is looking for tailcall and transforming as first
one.

I did some hacks locally, but then I found gcc generated even worse code in 'y
= IFN_SQRT' part:

f:
pxor  %xmm1, %xmm1
movaps %xmm0, %xmm2
ucomiss %xmm0, %xmm1
sqrtss %xmm2, %xmm2
ja   .L4
movaps %xmm2, %xmm0
ret
.L4:
jmp  sqrtf

Then I used LLVM's approach no matter call is in tail position or not, and it
gives:

f:
  pxor  %xmm1, %xmm1
  ucomiss %xmm0, %xmm1
  ja   .L4
  sqrtss %xmm0, %xmm0
  ret
.L4:
  jmp  sqrtf 

Also in comment 6, I did some test for LLVM's approach.

Sorry for the confused comment again.

[Bug tree-optimization/90106] builtin sqrt() ignoring libm's sqrt call result

2019-04-24 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106

--- Comment #8 from JunMa  ---
(In reply to Alexander Monakov from comment #6)
> Reopening and confirming, GCC's code looks less efficient than possible for
> no good reason.
> 
> CDCE does
> 
> y = sqrt (x);
>  ==>
> y = IFN_SQRT (x);
> if (__builtin_isless (x, 0))
> sqrt (x);
> 
> but it could do
> 
> y = IFN_SQRT (x);
> if (__builtin_isless (x, 0))
> y = sqrt (x);
> 
> (note two assignments to y)
> 

what is the difference between this and LLVM's approach ? 

> or to mimic LLVM's approach:
> 
> if (__builtin_isless (x, 0))
> y = sqrt (x);
> else
> y = IFN_SQRT (x);

I have finished a patch which do as same as LLVM in cdce pass, and test with
case below:

 #include 
  int main () {
float x = 1.0;
float y;
for (int i=0; i<1; i++) {
  y += sqrtf (x+i);
}
return y;
  }

And I've got, for x86-64 with O2:

  # original asm of IFN_SQRT part
.L4:
  pxor  %xmm0, %xmm0
  cvtsi2ssl  %ebx, %xmm0
  addss  %xmm3, %xmm0
  ucomiss %xmm0, %xmm4
  movaps %xmm0, %xmm2
  sqrtss %xmm2, %xmm2
  ja  .L7

and perf stat : 
 1,423,652,277  cycles#2.180 GHz   
  (83.31%)
 1,121,862,980  stalled-cycles-frontend   #   78.80% frontend cycles
idle (83.31%)
   634,957,413  stalled-cycles-backend#   44.60% backend cycles
idle  (66.62%)
 1,102,109,423  instructions  #0.77  insn per cycle 
  #1.02  stalled cycles per
insn  (83.31%)
   200,400,940  branches  #  306.873 M/sec 
  (83.44%)
 7,734  branch-misses #0.00% of all branches   
  (83.44%)



#transformed asm : 
.L4:
  pxor  %xmm0, %xmm0
  cvtsi2ssl  %ebx, %xmm0
  addss  %xmm3, %xmm0
  ucomiss %xmm0, %xmm2
  ja   .L8
  sqrtss %xmm0, %xmm0

and perf stat:
 1,418,560,722  cycles#2.180 GHz   
  (83.25%)
 1,116,732,674  stalled-cycles-frontend   #   78.72% frontend cycles
idle (83.25%)
   674,837,417  stalled-cycles-backend#   47.57% backend cycles
idle  (66.81%)
 1,003,067,037  instructions  #0.71  insn per cycle 
  #1.11  stalled cycles per
insn  (83.41%)
   200,619,151  branches  #  308.272 M/sec 
  (83.40%)
 5,637  branch-misses #0.00% of all branches   
  (83.28%)


The transformed case has less instructions and gets better performance which
looks good to me. However, one thing that I noticed is the original case gets
less 'stalled-cycles-backend', since its code has better ILP.

I'm not sure which approach is better.

Environment:
gcc version:  gcc trunk@270488 
OS: centos7.2
HW: Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz

[Bug c/89774] Add flag to force single precision

2019-04-22 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89774

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #10 from JunMa  ---
(In reply to Segher Boessenkool from comment #9)
> We currently only do it for trivial cases, as the example in comment 6 shows
> as well.  This is done during expand, which is the wrong place for it.
> 
> PR90070 is asking for better optimisation of this: do the operation in single
> precision, and use single-precision constants, if this does not change the
> result (or there is some -ffast-math option).
> 
> PR22326 is also closely related.  I don't think we can close any of these PRs
> as a dup of another, they are all asking for slightly different things :-)

clang can do this optimization in instcombine pass. see this case:

  float f4( float x ) {double t = x + 2.0; return  t; }
  float f5( float x ) {return  x + 2.0;  }

compiled with -O2 -march=native, GCC gives:

f4:
vcvtss2sd%xmm0, %xmm0, %xmm0
vaddsd .LC1(%rip), %xmm0, %xmm0
vcvtsd2ss%xmm0, %xmm0, %xmm0
ret

f5:
vaddss .LC3(%rip), %xmm0, %xmm0
ret

while clang always emits vaddss instruction.

[Bug tree-optimization/90106] builtin sqrt() ignoring libm's sqrt call result

2019-04-17 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #7 from JunMa  ---
yes, the transformation in CDEC prevent the tail call optimization. let's check
the return stmt in CDEC pass.

[Bug middle-end/89922] Loop on fixed size array is not unrolled and poorly optimized at -O2

2019-04-10 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89922

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #5 from JunMa  ---
the testcase in https://godbolt.org/z/iKi0pb is well optimized in gcc6.5 with
O3, but not gcc7 and later. 
I have checked the gimple code dumped by optimized pass which are same.
The difference is done by rtl_cse1 pass.

[Bug middle-end/89977] missing -Wstringop-overflow with an out-of-bounds int128_t range

2019-04-08 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89977

--- Comment #5 from JunMa  ---
(In reply to Martin Sebor from comment #4)
> You're right that the conversion from int128_t to unsigned long can result
> in truncation, so the range of the result is that of unsigned long.  Yet I
> suspect that relying on it is more likely unintentional and a bug.  The
> question in my mind is whether narrowing int128_t conversions should be
> diagnosed just in these contexts (i.e., -Wstringop-overflow) or in others as
> well.

We have no idea whether these truncations is intentional or not in gcc side,
maybe we need a new option such as Wstringop-truncation to do this.

[Bug middle-end/89977] missing -Wstringop-overflow with an out-of-bounds int128_t range

2019-04-08 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89977

--- Comment #3 from JunMa  ---
(In reply to JunMa from comment #2)
> After a bit more thinking, the behavior of gcc trunk is right. the range of
> n_3 in truncation from int128 to long unsigned int equal to the range of
> long unsigned int. for example: if n_3 = 0x1, then _1 is 0 which is
> less than 7.
> 
> so this is not a bug.

sorry, when n_3 = 0x1000  , _1 is 0.

[Bug middle-end/89977] missing -Wstringop-overflow with an out-of-bounds int128_t range

2019-04-08 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89977

--- Comment #2 from JunMa  ---
After a bit more thinking, the behavior of gcc trunk is right. the range of n_3
in truncation from int128 to long unsigned int equal to the range of long
unsigned int. for example: if n_3 = 0x1, then _1 is 0 which is less
than 7.

so this is not a bug.

[Bug middle-end/89977] missing -Wstringop-overflow with an out-of-bounds int128_t range

2019-04-08 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89977

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #1 from JunMa  ---
in function f, the conversion of stmt  _1 = (long unsigned int) n_3 is
extending, while in function g,  the conversion of stmt  _1 = (long unsigned
int) n_3 is truncating. 
For integer type truncation, gcc compute the range of target only if the range
size of source is less than what the precision of the target type can
represent.

I think this can be relaxed when the target type of truncation is unsigned.

[Bug middle-end/89934] [9 Regression] ICE on a call with fewer arguments to strncpy declared without prototype

2019-04-03 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89934

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #5 from JunMa  ---
similar issue in pr89911

[Bug middle-end/89911] [9 Regression] ICE in get_attr_nonstring_decl, at calls.c:1502

2019-04-01 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89911

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #1 from JunMa  ---
diff --git a/gcc/calls.c b/gcc/calls.c
index 63c1bc5..d940ec8 100644
--- a/gcc/calls.c
+++ b/gcc/calls.c
@@ -1556,6 +1556,8 @@ maybe_warn_nonstring_arg (tree fndecl, tree exp)
 return;

   unsigned nargs = call_expr_nargs (exp);
+  if (nargs == 0)
+return;

   /* The bound argument to a bounded string function like strncpy.  */
   tree bound = NULL_TREE;


this patch fixes it.

[Bug ipa/89341] [7/8/9 Regression] ICE in get, at cgraph.h:1332

2019-03-28 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89341

--- Comment #12 from JunMa  ---
(In reply to Jan Hubicka from comment #11)
> Removing the alias check seems correct to me.  The same body alias patch was
> long and needed special casing those aliases on quite few places. I am not
> at all sure why I added this one, but it definitly silences the diagnostics
> completely that is wrong.

we cannot remove the alias check here directly, since the definition and alias
field of target node is set to true in cgraph_node::create_alias. Consider:

static void __attribute__((weakref("bar"))) foo1(void); 
static void __attribute__((weakref("foo1"))) foo2(void);
void bar(); 

if alias check removed, gcc gives warning at foo2.

I have sent the patch to maillist, see
https://gcc.gnu.org/ml/gcc-patches/2019-03/msg01249.html, please have a look.

[Bug tree-optimization/89809] movzwl is not utilized when uint16_t is loaded with bit-shifts (while memcpy does)

2019-03-26 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89809

--- Comment #3 from JunMa  ---
the stmt generated by fe has some issue, in 004t.original dump file:
return  = (uint16_t) ((signed short) *p | (signed short) ((int) *(p +
1) << 8));

However, the return stmt should be:

return  = (uint16_t) (((int)(uint16_t) *p) | ((int)(uint16_t) *(p + 1)
<< 8));

then gcc will optimize it.

[Bug tree-optimization/89809] movzwl is not utilized when uint16_t is loaded with bit-shifts (while memcpy does)

2019-03-26 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89809

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #2 from JunMa  ---
g++ pr89809.cpp -O3 -fdump-tree-store-merging: 

foo (const unsigned char * p)
{
  unsigned char _1;
  signed short _2;
  unsigned char _3;
  int _4;
  int _5;
  signed short _6;
  signed short _7;
  uint16_t _10;

   [local count: 1073741824]:
  _1 = *p_9(D);
  _2 = (signed short) _1;
  _3 = MEM[(const unsigned char *)p_9(D) + 1B];
  _4 = (int) _3;
  _5 = _4 << 8;
  _6 = (signed short) _5;
  _7 = _2 | _6;
  _10 = (uint16_t) _7;
  return _10;

}


looks like gcc generates too many type conversions, this prevents the
optimization.

[Bug ipa/89341] [7/8/9 Regression] ICE in get, at cgraph.h:1332

2019-03-24 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89341

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #10 from JunMa  ---
I saw same issue with alias attribute. gcc should error out when weakref or
alias attribute attached to a definition. I'll send patch and test cases later.

[Bug tree-optimization/89772] memchr for a character not in constant nul-padded string not folded

2019-03-20 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89772

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #2 from JunMa  ---
Agreed, I think we should get trailing nulls from c_getstr() for 
array when fold memchr/memcmp/bcmp builtins.