[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations

2021-11-27 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

Andrew Pinski  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #6 from Andrew Pinski  ---
Dup of bug 64897.

*** This bug has been marked as a duplicate of bug 64897 ***

[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations

2020-01-09 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

--- Comment #5 from Alexander Monakov  ---
Ah, in that sense. The extra load is problematic in cold code where it's likely
a TLB miss. For hot code: the load does not depend on any previous computations
and so does not increase dependency chains. So it's ok from latency point of
view; from throughput point of view, there's a tradeoff, one extra load per
chain may be ok, but if every other instruction in a chain needs a different
load, that's probably excessive. So it needs to be costed somehow.

That said, sufficiently simple constants can be synthesized with SSE in-place
without loading them from memory, for example the constant in the opening
example:

  pcmpeqd %xmm1, %xmm1  // xmm1 = ~0
  pslld   $31, %xmm1// xmm1 <<= 31

(again, if we need to synthesize just one constant per chain that's preferable,
if we need many, the extra work would need to be costed against the latency
improvement of keeping the chain on SSE)

[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations

2020-01-08 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

--- Comment #4 from rguenther at suse dot de  ---
On January 8, 2020 4:34:40 PM GMT+01:00, "amonakov at gcc dot gnu.org"
 wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039
>
>--- Comment #3 from Alexander Monakov  ---
>> The question is for which CPUs is it actually faster to use SSE?
>
>In the context of chains where the source and the destination need to
>be SSE
>registers, pretty much all CPUs? Inter-unit moves typically have some
>latency,
>e.g. recent AMD (since Zen) and Intel (Skylake) have latency 3 for
>sse<->gpr
>moves (surprisingly though four generations prior to Skylake had
>latency 1).
>Older AMDs with shared fpu had even worse latencies. At the same time
>SSE
>integer ops have comparable latencies and throughput to gpr ones, so
>generally
>moving a chain to SSE ops isn't making it slower. Plus it helps with
>register
>pressure.
>
>When either the source or the destination of a chain is bound to a
>general
>register or memory, it's ok to continue doing it on general regs.

But we need an extra load for the constant operand with an SSE op.

[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations

2020-01-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

--- Comment #3 from Alexander Monakov  ---
> The question is for which CPUs is it actually faster to use SSE?

In the context of chains where the source and the destination need to be SSE
registers, pretty much all CPUs? Inter-unit moves typically have some latency,
e.g. recent AMD (since Zen) and Intel (Skylake) have latency 3 for sse<->gpr
moves (surprisingly though four generations prior to Skylake had latency 1).
Older AMDs with shared fpu had even worse latencies. At the same time SSE
integer ops have comparable latencies and throughput to gpr ones, so generally
moving a chain to SSE ops isn't making it slower. Plus it helps with register
pressure.

When either the source or the destination of a chain is bound to a general
register or memory, it's ok to continue doing it on general regs.

[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations

2020-01-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2020-01-08
 CC||rguenth at gcc dot gnu.org,
   ||uros at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #2 from Richard Biener  ---
STV doesn't recognize

(insn 7 6 11 2 (parallel [
(set (subreg:SI (reg:SF 84 [  ]) 0)
(and:SI (subreg:SI (reg:SF 88) 0)
(const_int 2147483647 [0x7fff])))
(clobber (reg:CC 17 flags))
]) "t.c":5:13 444 {*andsi_1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(expr_list:REG_DEAD (reg:SF 88)
(nil

it has

  if (!REG_P (XEXP (src, 0))
  && !MEM_P (XEXP (src, 0))
  && !CONST_INT_P (XEXP (src, 0))
  /* Check for andnot case.  */
  && (GET_CODE (src) != AND
  || GET_CODE (XEXP (src, 0)) != NOT
  || !REG_P (XEXP (XEXP (src, 0), 0
  return false;

and thus doesn't allow punning subregs.  OTOH I wonder why the above
isn't matched by a SImode SSE op ... (yeah, well, we don't have that).

If I "fix" STV with

Index: gcc/config/i386/i386-features.c
===
--- gcc/config/i386/i386-features.c (revision 280006)
+++ gcc/config/i386/i386-features.c (working copy)
@@ -1365,7 +1365,7 @@ general_scalar_to_vector_candidate_p (rt
   || GET_MODE (dst) != mode)
 return false;

-  if (!REG_P (dst) && !MEM_P (dst))
+  if (!REG_P (dst) && !SUBREG_P (dst) && !MEM_P (dst))
 return false;

   switch (GET_CODE (src))
@@ -1422,6 +1422,7 @@ general_scalar_to_vector_candidate_p (rt
 }

   if (!REG_P (XEXP (src, 0))
+  && !SUBREG_P (XEXP (src, 0))
   && !MEM_P (XEXP (src, 0))
   && !CONST_INT_P (XEXP (src, 0))
   /* Check for andnot case.  */

I see

Building chain #1...
  Adding insn 7 to chain #1
  r84 use in insn 11 isn't convertible
  Mark r84 def in insn 7 as requiring both modes in chain #1
  r88 def in insn 14 isn't convertible
  Mark r88 def in insn 14 as requiring both modes in chain #1
Collected chain #1...
  insns: 7
  defs to convert: r84, r88
Computing gain for chain #1...
  Instruction gain -6 for 7: {r84:SF#0=r88:SF#0&0x7fff;clobber
flags:CC;}
  REG_UNUSED flags:CC
  REG_DEAD r88:SF
  Instruction conversion gain: -6
  Registers conversion cost: 12
  Total gain: -18
Chain #1 conversion is not profitable

so besides it not handling the subregs correctly for costing the
costing for the actual instruction is negative as well (likely
because of the cost of loading the constant).  STV doesn't compute
"gain" when an existing conversion becomes unnecessary either.

The question is for which CPUs is it actually faster to use SSE?

[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations

2019-12-21 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

Marc Glisse  changed:

   What|Removed |Added

 Target||x86_64-*-*

--- Comment #1 from Marc Glisse  ---
This looks related to Bug 54716 (which was restricted to vectors).