[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-07-12 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

--- Comment #10 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:13c556d6ae84be3ee2bc245a56eafa58221de86a

commit r14-2447-g13c556d6ae84be3ee2bc245a56eafa58221de86a
Author: liuhongt 
Date:   Thu Jun 29 14:25:28 2023 +0800

Break false dependence for vpternlog by inserting vpxor or setting
constraint of input operand to '0'

False dependency happens when destination is only updated by
pternlog. There is no false dependency when destination is also used
in source. So either a pxor should be inserted, or input operand
should be set with constraint '0'.

gcc/ChangeLog:

PR target/110438
PR target/110202
* config/i386/predicates.md
(int_float_vector_all_ones_operand): New predicate.
* config/i386/sse.md (*vmov_constm1_pternlog_false_dep): New
define_insn.
(*_cvtmask2_pternlog_false_dep):
Ditto.
(*_cvtmask2_pternlog_false_dep):
Ditto.
(*_cvtmask2): Adjust to
define_insn_and_split to avoid false dependence.
(*_cvtmask2): Ditto.
(one_cmpl2): Adjust constraint
of operands 1 to '0' to avoid false dependence.
(*andnot3): Ditto.
(iornot3): Ditto.
(*3): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110438.c: New test.
* gcc.target/i386/pr100711-6.c: Adjust testcase.

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

--- Comment #9 from Alexander Monakov  ---
(In reply to Hongtao.liu from comment #8)
> 
> For this one, we can load *a into %zmm0 to avoid false_dependence.
> 
> vmovdqau ZMMWORD PTR [rdi], zmm0
> vpternlogq  zmm0, zmm0, zmm0, 85

Yes, since ternlog with memory operand needs two fused-domain uops on Intel
CPUs, breaking out the load would be more efficient for both negate1 and
negate2.

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

--- Comment #8 from Hongtao.liu  ---
(In reply to Alexander Monakov from comment #7)
> Note that vpxor serves as a dependency-breaking instruction (see PR 110438).
> So in negate1 we do the right thing for the wrong reasons, and in negate2 we
> can cause a substantial stall if the previous computation of xmm0 has a
> non-trivial dependency chain.

For this one, we can load *a into %zmm0 to avoid false_dependence.

vmovdqau ZMMWORD PTR [rdi], zmm0
vpternlogq  zmm0, zmm0, zmm0, 85

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

--- Comment #7 from Alexander Monakov  ---
Note that vpxor serves as a dependency-breaking instruction (see PR 110438). So
in negate1 we do the right thing for the wrong reasons, and in negate2 we can
cause a substantial stall if the previous computation of xmm0 has a non-trivial
dependency chain.

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
(In reply to Jakub Jelinek from comment #3)
> And I must say I don't immediately see easy rules how to find out from the
> immediate value which set is which, so unless we find some easy rule for
> that, we'd need to hardcode the mapping between the 256 values to a bitmask
> which inputs are actually used.

Well, that's really easy. The immediate is just a eight-entry look-up table
from any possible input bit triple to the output bit. The leftmost operand
corresponds to the most significant bit in the triple, so to check if the
operation vpternlog(A, B, C, I) is invariant w.r.t A you check if nibbles of I
are equal. Here we have 0x55, equal nibbles, and the operation is invariant
w.r.t A.

Similarly, to check if it's invariant w.r.t B we check if two-bit groups in I
come in pairs, or in code: (I & 0x33) == ((I >> 2) & 0x33). For 0x55 both sides
evaluate to 0x11, so again, invariant w.r.t B.

Finally, checking invariantness w.r.t C is (I & 0x55) == ((I >> 1) & 0x55).

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-12 Thread fabio at cannizzo dot net via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

--- Comment #5 from Fabio Cannizzo  ---
> Well, there is nothing magic on exactly 0x55 immediate, there are 256
> possible immediates, most of them use all of A, B, C, some of them use just
> A, B, others just B, C, others just A, C, others just A, others just B,
> others just C, others none of them.

Indeed I meant 0x55 just as an example.

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-10 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

Andrew Pinski  changed:

   What|Removed |Added

   Last reconfirmed||2023-06-10
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #4 from Andrew Pinski  ---
(In reply to Jakub Jelinek from comment #3)
> Well, there is nothing magic on exactly 0x55 immediate, there are 256
> possible immediates, most of them use all of A, B, C, some of them use just
> A, B, others just B, C, others just A, C, others just A, others just B,
> others just C, others none of them.
> And I must say I don't immediately see easy rules how to find out from the
> immediate value which set is which, so unless we find some easy rule for
> that, we'd need to hardcode the mapping between the 256 values to a bitmask
> which inputs are actually used.
> And then the question is how to represent that in RTL to make it clear that
> some operands are mentioned but their value isn't really used.

In the case of 0x55, an idea might be to split (or expand) it into how ~ is
represented.

That is:
(insn:TI 6 3 12 2 (set (reg:V8DI 20 xmm0 [85])
(xor:V8DI (mem:V8DI (reg/v/f:DI 5 di [orig:84 a ] [84]) [0 *a_3(D)+0
S64 A512])
(const_vector:V8DI [
(const_int -1 [0x]) repeated x8
]))) "/app/example.cpp":21:14 6764 {*one_cmplv8di2}
 (expr_list:REG_DEAD (reg/v/f:DI 5 di [orig:84 a ] [84])
(nil)))

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-10 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

Jakub Jelinek  changed:

   What|Removed |Added

 CC||hjl.tools at gmail dot com,
   ||jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek  ---
Well, there is nothing magic on exactly 0x55 immediate, there are 256 possible
immediates, most of them use all of A, B, C, some of them use just A, B, others
just B, C, others just A, C, others just A, others just B, others just C,
others none of them.
And I must say I don't immediately see easy rules how to find out from the
immediate value which set is which, so unless we find some easy rule for that,
we'd need to hardcode the mapping between the 256 values to a bitmask which
inputs are actually used.
And then the question is how to represent that in RTL to make it clear that
some operands are mentioned but their value isn't really used.

[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations

2023-06-10 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

Andrew Pinski  changed:

   What|Removed |Added

  Component|target  |rtl-optimization
   Severity|normal  |enhancement

--- Comment #2 from Andrew Pinski  ---
Note you get a warning in your negate1 case


: In function '__m512i negate1(const __m512i*)':
:7:36: warning: 'res' is used uninitialized [-Wuninitialized]
7 | res = _mm512_ternarylogic_epi64(res, res, *a, 0x55);
  |   ~^~~~
:6:13: note: 'res' was declared here
6 | __m512i res;
  | ^~~


But even doing this:
__m512i negate1(const __m512i *a)
{
__m512i res = _mm512_undefined_si512 ();
res = _mm512_ternarylogic_epi64(res, res, *a, 0x55);
return res;
}


Will cause an extra zeroing.