[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 Andrew Pinski changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE --- Comment #6 from Andrew Pinski --- Dup of bug 64897. *** This bug has been marked as a duplicate of bug 64897 ***
[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 --- Comment #5 from Alexander Monakov --- Ah, in that sense. The extra load is problematic in cold code where it's likely a TLB miss. For hot code: the load does not depend on any previous computations and so does not increase dependency chains. So it's ok from latency point of view; from throughput point of view, there's a tradeoff, one extra load per chain may be ok, but if every other instruction in a chain needs a different load, that's probably excessive. So it needs to be costed somehow. That said, sufficiently simple constants can be synthesized with SSE in-place without loading them from memory, for example the constant in the opening example: pcmpeqd %xmm1, %xmm1 // xmm1 = ~0 pslld $31, %xmm1// xmm1 <<= 31 (again, if we need to synthesize just one constant per chain that's preferable, if we need many, the extra work would need to be costed against the latency improvement of keeping the chain on SSE)
[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 --- Comment #4 from rguenther at suse dot de --- On January 8, 2020 4:34:40 PM GMT+01:00, "amonakov at gcc dot gnu.org" wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 > >--- Comment #3 from Alexander Monakov --- >> The question is for which CPUs is it actually faster to use SSE? > >In the context of chains where the source and the destination need to >be SSE >registers, pretty much all CPUs? Inter-unit moves typically have some >latency, >e.g. recent AMD (since Zen) and Intel (Skylake) have latency 3 for >sse<->gpr >moves (surprisingly though four generations prior to Skylake had >latency 1). >Older AMDs with shared fpu had even worse latencies. At the same time >SSE >integer ops have comparable latencies and throughput to gpr ones, so >generally >moving a chain to SSE ops isn't making it slower. Plus it helps with >register >pressure. > >When either the source or the destination of a chain is bound to a >general >register or memory, it's ok to continue doing it on general regs. But we need an extra load for the constant operand with an SSE op.
[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 --- Comment #3 from Alexander Monakov --- > The question is for which CPUs is it actually faster to use SSE? In the context of chains where the source and the destination need to be SSE registers, pretty much all CPUs? Inter-unit moves typically have some latency, e.g. recent AMD (since Zen) and Intel (Skylake) have latency 3 for sse<->gpr moves (surprisingly though four generations prior to Skylake had latency 1). Older AMDs with shared fpu had even worse latencies. At the same time SSE integer ops have comparable latencies and throughput to gpr ones, so generally moving a chain to SSE ops isn't making it slower. Plus it helps with register pressure. When either the source or the destination of a chain is bound to a general register or memory, it's ok to continue doing it on general regs.
[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2020-01-08 CC||rguenth at gcc dot gnu.org, ||uros at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from Richard Biener --- STV doesn't recognize (insn 7 6 11 2 (parallel [ (set (subreg:SI (reg:SF 84 [ ]) 0) (and:SI (subreg:SI (reg:SF 88) 0) (const_int 2147483647 [0x7fff]))) (clobber (reg:CC 17 flags)) ]) "t.c":5:13 444 {*andsi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_DEAD (reg:SF 88) (nil it has if (!REG_P (XEXP (src, 0)) && !MEM_P (XEXP (src, 0)) && !CONST_INT_P (XEXP (src, 0)) /* Check for andnot case. */ && (GET_CODE (src) != AND || GET_CODE (XEXP (src, 0)) != NOT || !REG_P (XEXP (XEXP (src, 0), 0 return false; and thus doesn't allow punning subregs. OTOH I wonder why the above isn't matched by a SImode SSE op ... (yeah, well, we don't have that). If I "fix" STV with Index: gcc/config/i386/i386-features.c === --- gcc/config/i386/i386-features.c (revision 280006) +++ gcc/config/i386/i386-features.c (working copy) @@ -1365,7 +1365,7 @@ general_scalar_to_vector_candidate_p (rt || GET_MODE (dst) != mode) return false; - if (!REG_P (dst) && !MEM_P (dst)) + if (!REG_P (dst) && !SUBREG_P (dst) && !MEM_P (dst)) return false; switch (GET_CODE (src)) @@ -1422,6 +1422,7 @@ general_scalar_to_vector_candidate_p (rt } if (!REG_P (XEXP (src, 0)) + && !SUBREG_P (XEXP (src, 0)) && !MEM_P (XEXP (src, 0)) && !CONST_INT_P (XEXP (src, 0)) /* Check for andnot case. */ I see Building chain #1... Adding insn 7 to chain #1 r84 use in insn 11 isn't convertible Mark r84 def in insn 7 as requiring both modes in chain #1 r88 def in insn 14 isn't convertible Mark r88 def in insn 14 as requiring both modes in chain #1 Collected chain #1... insns: 7 defs to convert: r84, r88 Computing gain for chain #1... Instruction gain -6 for 7: {r84:SF#0=r88:SF#0&0x7fff;clobber flags:CC;} REG_UNUSED flags:CC REG_DEAD r88:SF Instruction conversion gain: -6 Registers conversion cost: 12 Total gain: -18 Chain #1 conversion is not profitable so besides it not handling the subregs correctly for costing the costing for the actual instruction is negative as well (likely because of the cost of loading the constant). STV doesn't compute "gain" when an existing conversion becomes unnecessary either. The question is for which CPUs is it actually faster to use SSE?
[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 Marc Glisse changed: What|Removed |Added Target||x86_64-*-* --- Comment #1 from Marc Glisse --- This looks related to Bug 54716 (which was restricted to vectors).