[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2018-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|7.3 |7.4

--- Comment #11 from Richard Biener  ---
GCC 7.3 is being released, adjusting target milestone.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-12-19 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

--- Comment #10 from Richard Biener  ---
(In reply to Jeffrey A. Law from comment #9)
> So AFAICT there's two issues that need to be addressed.  PRE and split-paths.
> 
> First up is PRE.  Compile the sample code from c#5/c#6 with -O3
> -fno-split-paths
> 
> 
> Prior to PRE we have:
> 
>   if (_16 != 0)
> goto ; [50.00%]
>   else
> goto ; [50.00%]
> 
>[local count: 531502203]:
> 
>[local count: 1063004407]:
>   # iftmp.0_19 = PHI <2567483615(3), 0(4)>
>   _17 = _15 ^ iftmp.0_19;
> 
> That's actually reasonably good.  While it's not a conditional move in the
> gimple.  It's in a form will be easy for the RTL optimizers to handle and
> generate a suitable cmov if we just left it alone on x86_64.
> 
> 
> PRE (correctly) identifies that it can reduce the number of expression
> evaluations on the path traversing bb3->bb5 by hoisting the XOR with the
> non-zero constant into BB4 resulting in:
> 
>  if (_16 != 0)
> goto ; [50.00%]
>   else
> goto ; [50.00%]
> 
>[local count: 531502203]:
>   _52 = _15 ^ 2567483615;
> 
>[local count: 1063004407]:
>   # iftmp.0_19 = PHI <2567483615(4), 0(3)>
>   # prephitmp_53 = PHI <_52(4), _15(3)>
> 
> That's correct, but far from ideal.
> 
> 
> So the second issue is split-paths.  There's actually two problems to deal
> with in split-paths.
> 
> 
> 
> 
> As it stands today this is what we see in split-paths (as a result of the
> PRE de-optimization):
> 
> 
>   
>   [ ... ]
>   if (_20 != 0)
> goto ; [50.00%]
>   else
> goto ; [50.00%]
> 
>[local count: 531502203]:
>   _18 = _25 ^ 2567483615;
> 
>[local count: 1063004407]:
>   # prephitmp_49 = PHI <_25(3), _18(4)>
>   _2 = (void *) ivtmp.8_30;
>   MEM[base: _2, offset: 0B] = prephitmp_49;
>   ivtmp.8_29 = ivtmp.8_30 + 8;
>   if (ivtmp.8_29 != _6)
> goto ; [98.99%]
>   else
> goto ; [1.01%]
> 
> split-paths should try not to muck it up further.  Note that we can probably
> identify this half-diamond pretty easily.  bb3 dominates bb4.  bb4 has a
> single statement that feeds a PHI in bb5.  That's a very likely
> if-conversion candidate so split-paths ought to leave it alone.
> 
> If we were to fix PRE then split-paths would be presented with something
> like this:
> 
>  
>  [  ... ]
>  if (_47 != 0)
> goto ; [50.00%]
>   else
> goto ; [50.00%]
> 
>[local count: 531502203]:
> 
>[local count: 1063004407]:
>   # iftmp.0_48 = PHI <2567483615(3), 0(4)>
>   _49 = _18 ^ iftmp.0_48;
> 
> ISTM that when either of the blocks in question (bb3 bb4) has *no*
> statements, with a single pred that is the other block then split-blocks
> definitely should leave it alone as well.
> 
> So, to summarize.
> 
> 1. PRE mucks things up a bit.
> 2. split-paths makes it worse
> 
> I've got a prototype patch that implements the two improvements to keep
> split-paths from making things worse.  That will improve things, but to
> really do a good job we'll have to either do something about PRE or have a
> pass after PRE undo PRE's deoptimization.

I don't think it's per-se a PRE "deoptimization", it's simply what PRE
is supposed to do.  The result should be quite optimal given we
should be able to coalesce _52 and _15 and thus require no edge copies
into bb 5.

 if (_16 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 531502203]:
  _52 = _15 ^ 2567483615;

   [local count: 1063004407]:
  # prephitmp_53 = PHI <_52(4), _15(3)>

with a "reasonable" CPU we'd end up with sth like

   test r1, p1  // _16 != 0 into predicate reg p1
   xor r2, 2567483615, p1  // xor predicated with p1

now of course path splitting ruins things here (I never ever liked that pass,
it's very low-level transform is more suitable for RTL).  And if we'd
get conditional execution modeled "properly" in GIMPLE we could if-convert
there, preventing path-splitting from mucking up things here...

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-12-19 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Jeffrey A. Law  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #9 from Jeffrey A. Law  ---
So AFAICT there's two issues that need to be addressed.  PRE and split-paths.

First up is PRE.  Compile the sample code from c#5/c#6 with -O3
-fno-split-paths


Prior to PRE we have:

  if (_16 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 531502203]:

   [local count: 1063004407]:
  # iftmp.0_19 = PHI <2567483615(3), 0(4)>
  _17 = _15 ^ iftmp.0_19;

That's actually reasonably good.  While it's not a conditional move in the
gimple.  It's in a form will be easy for the RTL optimizers to handle and
generate a suitable cmov if we just left it alone on x86_64.


PRE (correctly) identifies that it can reduce the number of expression
evaluations on the path traversing bb3->bb5 by hoisting the XOR with the
non-zero constant into BB4 resulting in:

 if (_16 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 531502203]:
  _52 = _15 ^ 2567483615;

   [local count: 1063004407]:
  # iftmp.0_19 = PHI <2567483615(4), 0(3)>
  # prephitmp_53 = PHI <_52(4), _15(3)>

That's correct, but far from ideal.


So the second issue is split-paths.  There's actually two problems to deal with
in split-paths.




As it stands today this is what we see in split-paths (as a result of the PRE
de-optimization):


  
  [ ... ]
  if (_20 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 531502203]:
  _18 = _25 ^ 2567483615;

   [local count: 1063004407]:
  # prephitmp_49 = PHI <_25(3), _18(4)>
  _2 = (void *) ivtmp.8_30;
  MEM[base: _2, offset: 0B] = prephitmp_49;
  ivtmp.8_29 = ivtmp.8_30 + 8;
  if (ivtmp.8_29 != _6)
goto ; [98.99%]
  else
goto ; [1.01%]

split-paths should try not to muck it up further.  Note that we can probably
identify this half-diamond pretty easily.  bb3 dominates bb4.  bb4 has a single
statement that feeds a PHI in bb5.  That's a very likely if-conversion
candidate so split-paths ought to leave it alone.

If we were to fix PRE then split-paths would be presented with something like
this:

 
 [  ... ]
 if (_47 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 531502203]:

   [local count: 1063004407]:
  # iftmp.0_48 = PHI <2567483615(3), 0(4)>
  _49 = _18 ^ iftmp.0_48;

ISTM that when either of the blocks in question (bb3 bb4) has *no* statements,
with a single pred that is the other block then split-blocks definitely should
leave it alone as well.

So, to summarize.

1. PRE mucks things up a bit.
2. split-paths makes it worse

I've got a prototype patch that implements the two improvements to keep
split-paths from making things worse.  That will improve things, but to really
do a good job we'll have to either do something about PRE or have a pass after
PRE undo PRE's deoptimization.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-12-14 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Jeffrey A. Law  changed:

   What|Removed |Added

 CC||law at redhat dot com

--- Comment #8 from Jeffrey A. Law  ---
I'd look more at split-paths rather than DOM in this case.

Prior to split-paths we have:

  [ ... ]
  _20 = y_33 & 1;
  if (_20 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 531502203]:
  _18 = _25 ^ 2567483615;

   [local count: 1063004407]:
  # prephitmp_49 = PHI <_25(3), _18(4)>
  _2 = (void *) ivtmp.8_30;
  [ ... ]

Split-paths comes along and decides it'd like split BB5 resulting in:

  _20 = y_33 & 1;
  if (_20 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 531502203]:
  _18 = _25 ^ 2567483615;

   [local count: 531502203]:
  # prephitmp_43 = PHI <_18(4)>
  _47 = (void *) ivtmp.8_30;
  MEM[base: _47, offset: 0B] = prephitmp_43;
  ivtmp.8_42 = ivtmp.8_30 + 8;
  if (ivtmp.8_42 != _6)
goto ; [98.99%]
  else
goto ; [1.01%]

   [local count: 531502204]:
  # prephitmp_49 = PHI <_25(3)>
  _2 = (void *) ivtmp.8_30;
  MEM[base: _2, offset: 0B] = prephitmp_49;
  ivtmp.8_29 = ivtmp.8_30 + 8;
  if (ivtmp.8_29 != _6)
goto ; [98.99%]
  else
goto ; [1.01%]

I don't see any simplifications that happen as a result of that transformation
and we're not really even going to be able to simplify the branching structure.
 So ISTM that split-paths is where we ought to concentrate.  It's been
problematic in this space before -- it's got a number of hacks already to try
and discourage path splitting in cases where doing so it going to inhibit
if-conversion.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-12-14 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org,
   ||law at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek  ---
This changed (looking at the #c6 testcase) with r242550, and the change was
intentional, tree if conversion except when it enables successful vectorization
is usually harmful, only rarely useful, but this is one of such cases.  You can
get back the previous behavior with -O3 -ftree-loop-if-convert.

The reason why the RTL CE stuff doesn't do anything here is that we end up with
multiple statements in the if block, so instead of:
  if (_20 != 0)
goto ;
  else
goto ;

  :
  _18 = _25 ^ 2567483615;

  :
  # _200 = PHI <_25(4), _18(4)>
  MEM[base: _3, offset: 0B] = _200;
  ivtmp.11_29 = ivtmp.11_30 + 8;
  if (_6 == ivtmp.11_29)
goto ;
  else
goto ;

we get:
  if (_20 != 0)
goto ;
  else
goto ;

  :
  _18 = _25 ^ 2567483615;
  MEM[base: _3, offset: 0B] = _18;
  ivtmp.11_42 = ivtmp.11_30 + 8;
  if (_6 == ivtmp.11_42)
goto ;
  else
goto ;

  :
  MEM[base: _3, offset: 0B] = _25;
  ivtmp.11_29 = ivtmp.11_30 + 8;
  if (_6 == ivtmp.11_29)
goto ;
  else
goto ;

This is created by dom, I wonder what benefit is in this case.  Even if we
don't improve it in ifcvt.c, if we can make the bb 5 with just the xor
fallthrough into bb 6, i.e. the conditional branch just jumps over the (single
insn), then that looks more beneficial to duplicating more stmts.  Jeff?
Though, seems that multiple passes are keen on doing this kind of stuff, so
in order to avoid that I have to use:
-O3 -fno-tree-dominator-opts -fno-tree-vrp -fno-split-paths
With that we get:
andl$1, %edx
je  .L2
xorq%r8, %rax
.L2:
movq%rax, (%rdi)
addq$8, %rdi
cmpq%rdi, %rsi
jne .L3
which is IMHO better, but still not the cmov.

The conditional block contains in that case:
(insn 20 19 21 5 (set (reg:DI 105)
(const_int 2567483615 [0x9908b0df])) 85 {*movdi_internal}
 (nil))
(insn 21 20 22 5 (parallel [
(set (reg:DI 93 [ _25 ])
(xor:DI (reg:DI 93 [ _25 ])
(reg:DI 105)))
(clobber (reg:CC 17 flags))
]) 443 {*xordi_1}
 (expr_list:REG_DEAD (reg:DI 105)
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil
which is too much for ifcvt.
If I change the #c6 testcase to:
void foo(unsigned long *M)
{
  for (unsigned long k = 0; k < 227; ++k)
{
  unsigned long y =
((M[k] & 0x8000) | (M[k + 1] & 0x7fff));
  M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 567483615 : 0));
}
}
so that the immediate fits into x86_64 signed 32-bit immediate, then we have
just:
(insn 20 19 21 5 (parallel [
(set (reg:DI 93 [ _25 ])
(xor:DI (reg:DI 93 [ _25 ])
(const_int 567483615 [0x21d31cdf])))
(clobber (reg:CC 17 flags))
]) 443 {*xordi_1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
in the conditional block and ifcvt.c can deal with that and we get:
movq%rax, %rsi
xorq$567483615, %rsi
andl$1, %edx
cmovne  %rsi, %rax
(of course disabling the jump threading and path splitting is still needed for
this).  So, if we can do something about those, perhaps we could extend ifcvt
so that it could deal with a set of a pseudo to a constant needed for the
following instruction too and take it into account in the costs.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-08-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|7.2 |7.3

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-08-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|7.2 |7.3

--- Comment #7 from Richard Biener  ---
GCC 7.2 is being released, adjusting target milestone.

--- Comment #8 from Richard Biener  ---
GCC 7.2 is being released, adjusting target milestone.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-08-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|7.2 |7.3

--- Comment #7 from Richard Biener  ---
GCC 7.2 is being released, adjusting target milestone.

--- Comment #8 from Richard Biener  ---
GCC 7.2 is being released, adjusting target milestone.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-05-03 Thread krister.walfridsson at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

--- Comment #5 from krister.walfridsson at gmail dot com ---
I have extracted a smaller test case.  The loops are generated from

  typedef mersenne_twister_engine<
  uint_fast32_t,
  32, 624, 397, 31,
  0x9908b0dfUL, 11,
  0xUL, 7,
  0x9d2c5680UL, 15,
  0xefc6UL, 18, 1812433253UL> mt19937;

and the expansion of the template end up with loops like

void foo(unsigned long *M)
{
  for (unsigned long k = 0; k < 227; ++k)
{
  unsigned long y =
((M[k] & 0x8000) | (M[k + 1] & 0x7fff));
  M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 2567483615 : 0));
}
}

which generates the dump described in the bug report.

--- Comment #6 from krister.walfridsson at gmail dot com ---
I have extracted a smaller test case.  The loops are generated from

  typedef mersenne_twister_engine<
  uint_fast32_t,
  32, 624, 397, 31,
  0x9908b0dfUL, 11,
  0xUL, 7,
  0x9d2c5680UL, 15,
  0xefc6UL, 18, 1812433253UL> mt19937;

and the expansion of the template end up with loops like

void foo(unsigned long *M)
{
  for (unsigned long k = 0; k < 227; ++k)
{
  unsigned long y =
((M[k] & 0x8000) | (M[k + 1] & 0x7fff));
  M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 2567483615 : 0));
}
}

which generates the dump described in the bug report.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-05-03 Thread krister.walfridsson at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

--- Comment #5 from krister.walfridsson at gmail dot com ---
I have extracted a smaller test case.  The loops are generated from

  typedef mersenne_twister_engine<
  uint_fast32_t,
  32, 624, 397, 31,
  0x9908b0dfUL, 11,
  0xUL, 7,
  0x9d2c5680UL, 15,
  0xefc6UL, 18, 1812433253UL> mt19937;

and the expansion of the template end up with loops like

void foo(unsigned long *M)
{
  for (unsigned long k = 0; k < 227; ++k)
{
  unsigned long y =
((M[k] & 0x8000) | (M[k + 1] & 0x7fff));
  M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 2567483615 : 0));
}
}

which generates the dump described in the bug report.

--- Comment #6 from krister.walfridsson at gmail dot com ---
I have extracted a smaller test case.  The loops are generated from

  typedef mersenne_twister_engine<
  uint_fast32_t,
  32, 624, 397, 31,
  0x9908b0dfUL, 11,
  0xUL, 7,
  0x9d2c5680UL, 15,
  0xefc6UL, 18, 1812433253UL> mt19937;

and the expansion of the template end up with loops like

void foo(unsigned long *M)
{
  for (unsigned long k = 0; k < 227; ++k)
{
  unsigned long y =
((M[k] & 0x8000) | (M[k + 1] & 0x7fff));
  M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 2567483615 : 0));
}
}

which generates the dump described in the bug report.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-04-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Richard Biener  changed:

   What|Removed |Added

 Depends on||80491

--- Comment #4 from Richard Biener  ---
Ok, so then this looks related (or dup) of PR80491.  I've also seen a similar
report that involves us no longer doing a specific inlining (similar in using
).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80491
[Bug 80491] [6/7/8 Regression] Compiler regression for long-add case.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-04-26 Thread krister.walfridsson at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

--- Comment #3 from krister.walfridsson at gmail dot com ---
You can see the issue in the generated code with

  int foo(std::mt19937 )
  {
std::uniform_int_distribution dist(0,99);
return dist(gen);
  }

too. I.e. it is not just an artifact of the uninteresting use in the
benchmarking loop.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-04-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2
   Target Milestone|--- |7.2

--- Comment #2 from Richard Biener  ---
Looks like again (from just the quoted dumps) path splitting removes the RTL
if-conversion opportunity here.  Of course just because the testcase wraps
the computation in a loop that does nothing interesting.

[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion

2017-04-26 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Marc Glisse  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-04-26
Summary|Performance regression from |[7/8 Regression]
   |missing if-conversion   |Performance regression from
   ||missing if-conversion
 Ever confirmed|0   |1

--- Comment #1 from Marc Glisse  ---
clang generates much slower code (and even worse with libc++).
With a combination of -march=native and -fprofile-{generate,use}, I can get
faster code from gcc-7 than from gcc-6, while the .optimized dump still has
branches and not cond_expr.