[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool

2021-08-18 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557

Andrew Pinski  changed:

   What|Removed |Added

 CC||gabravier at gmail dot com

--- Comment #4 from Andrew Pinski  ---
*** Bug 97743 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool

2021-06-04 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557

Andrew Pinski  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |pinskia at gcc dot 
gnu.org

--- Comment #3 from Andrew Pinski  ---
Mine.

[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool

2016-08-23 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557

Andrew Pinski  changed:

   What|Removed |Added

 CC||pinskia at gcc dot gnu.org
   Severity|normal  |enhancement

[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool

2016-02-03 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #2 from Peter Cordes  ---
Besides code-size, uop-cache size is a factor for Intel CPUs.  imul is only a
single uop, while neg/and is 2 uops.  Total number of instructions is a factor
for other CPUs, too, but only locally.  (Saving uop-cache space can mean
speedups for *other* code that doesn't get evicted).


If the operation isn't part of a long dependency chain, imul is a better choice
on almost all CPUs.  Let OOO execution sort it out.

When latency matters some, we have to weigh the tradeoff of code-size / more
insns and uops vs. slightly (or much) higher latency.

Agner Fog's instruction tables indicate that 32bit imul is probably ok for
tune=generic, but 64bit imul should maybe only be used with -mtune=intel (but
absolutely not with tune=atom.  Maybe not with tune=silvermont either, but it
does have modest OOO capabilities to hide the latency.  It's not as wide, so
saving insns maybe matters more?).  I'm not sure if tune=intel is supposed to
put much weight on pre-Silvermont Atom.

From Agner Fog's spreadsheet, updated 2016-Jan09:

   uops/m-ops   latency   recip-throughput   execution pipe/port
Intel:SnB-family(Sandybridge through Skylake)
imul r32,r32:  1  31  p1
imul r64,r64:  1  31  p1

AMD:bdver1-3
imul r32,r32:  1  42  EX1
imul r64,r64:  1  64  EX1


Intel:Silvermont
imul r32,r32:  1  31  IP0
imul r64,r64:  1  52  IP0

AMD:bobcat/jaguar
imul r32,r32:  1  31  I0
imul r64,r64:  1  64  I0




old HW
Intel:Nehalem
imul r32,r32:  1  31  p1  
imul r64,r64:  1  31  p0
Intel:Merom/Penryn(Core2)
imul r32,r32:  1  31  p1  
imul r64,r64:  1  52  p0  (same as FP mul, maybe
borrows its wider multiplier?)

Intel:Atom
imul r32,r32:  1  52  Alu0,Mul
imul r64,r64:  6 13   11  Alu0,Mul

AMD:K8/K10
imul r32,r32:  1  31  ALU0
imul r64,r64:  1  42  ALU0_1 (uses units 0 and 1)

VIA:Nano3000
imul r32,r32:  1  21  I2
imul r64,r64:  1  52  MA


If gcc keeps track of execution port pressure at all, it should also avoid imul
when surrounding code is multiply-heavy (or doing other stuff that also
contends for the same resources as imul).  I didn't check on neg/and, but I
assume every microarchitecture can run them on any port with one cycle latency
each.

getting off topic here:

tune=generic should account for popularity of CPUs, right?  So I hope it won't
sacrifice much speed for SnB-family in order to avoid something that's slow on
Pentium4, I hope.  (e.g. P4 doesn't like inc/dec, but all other CPUs rename the
carry flag separately to avoid the false dep.  Not a great example, because
that only saves a couple code bytes.  shrd isn't a good example, because it's
slow even on AMD Bulldozer.)

Is there a tune=no_glass_jaws that *will* give up speed (or code size) for
common CPUs in order to avoid things that are *really* bad on some rare
microarchitectures, (especially old ones)?  Or maybe a tune=desktop to doesn't
care what's slow on Atom/Jaguar?  People distributing binaries that probably
won't be used on Atom/Silvermont netbooks might use that.

Anyway, I think it would be neat to have the option of making a binary that
will be quite good on SnB, not have major problems on recent AMD, but I don't
care if it has the occasional slow instruction on Atom or K8.  Or alternatively
to have a binary that doesn't suck badly anywhere.

[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool

2015-11-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-*-*, i?86-*-*
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-11-26
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
it's larger though (so not -Os).  If 'b' is already available as condition code
then a conditonal move from zero would also work.

So I wonder how to represent this on the GIMPLE level.

  _2 = COND_EXPR ;

is a possibility and of course

  _2 = COND_EXPR ;
  _5 = x_4 & _2;

or

  _3 = (int) b_2;
  _4 = -_3;  // only if bool is unsigned
  _5 = x_6 & _4;

the latter rely on the bool having only a the LSB set.