https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #37 from Richard Biener ---
So my analysis was partly wrong and the vpinsrq isn't an issue for the
benchmark
but only the spilling is.
Note that the other idea of disparaging vector CTORs more like with
diff --git
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #36 from Richard Biener ---
(In reply to Richard Biener from comment #35)
> (In reply to Richard Biener from comment #33)
> > Created attachment 50308 [details]
> > patch
> >
> > I am testing the following.
>
> It FAILs
>
> FAIL:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #35 from Richard Biener ---
(In reply to Richard Biener from comment #33)
> Created attachment 50308 [details]
> patch
>
> I am testing the following.
It FAILs
FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #34 from Uroš Bizjak ---
(In reply to rguent...@suse.de from comment #32)
> what about reload_completed? We really only want to do this after RA.
No need for it, this is peephole2 pass that *always* runs after reload.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #33 from Richard Biener ---
Created attachment 50308
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50308=edit
patch
I am testing the following.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #32 from rguenther at suse dot de ---
On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
>
> --- Comment #31 from Uroš Bizjak ---
> (In reply to Richard Biener from comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #31 from Uroš Bizjak ---
(In reply to Richard Biener from comment #29)
> The simplified variant below works but IMHO matches cases we do not
> want to transform. I can't find any example on how to achieve that
> though.
I think
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #30 from Jakub Jelinek ---
(In reply to Richard Biener from comment #29)
> I suppose the reason is that there's two unrelated insns between the
> xmm0 = cx:DI and the vec_concat. Which would hint that we somehow
> need to not match
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #29 from Richard Biener ---
(In reply to Uroš Bizjak from comment #27)
> (In reply to Richard Biener from comment #26)
> > but that doesn't seem to match for some unknown reason.
>
> Try this:
>
> (define_peephole2
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #28 from Uroš Bizjak ---
(In reply to Uroš Bizjak from comment #27)
> (In reply to Richard Biener from comment #26)
> > but that doesn't seem to match for some unknown reason.
> Try this:
The latency problem with the original
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #27 from Uroš Bizjak ---
(In reply to Richard Biener from comment #26)
> but that doesn't seem to match for some unknown reason.
Try this:
(define_peephole2
[(match_scratch:DI 5 "Yv")
(set (match_operand:DI 0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #26 from Richard Biener ---
(In reply to rguent...@suse.de from comment #25)
> On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
> >
> > --- Comment #24 from Uroš Bizjak
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #25 from rguenther at suse dot de ---
On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
>
> --- Comment #24 from Uroš Bizjak ---
> (In reply to Richard Biener from comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #24 from Uroš Bizjak ---
(In reply to Richard Biener from comment #22)
> That works to avoid the vpinsrq. I guess the case of a mem operand
> behaves similar to a gpr (plus the load uop), at least I don't have any
> contrary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #23 from Richard Biener ---
Created attachment 50300
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50300=edit
preprocessed source of the important Botan TU
This is the full preprocessed source of the TU. When compiled with
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #22 from Richard Biener ---
(In reply to Uroš Bizjak from comment #21)
> (In reply to Uroš Bizjak from comment #20)
> > (In reply to Richard Biener from comment #18)
> > > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #21 from Uroš Bizjak ---
(In reply to Uroš Bizjak from comment #20)
> (In reply to Richard Biener from comment #18)
> > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not
> > sure if we should somehow do this
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #20 from Uroš Bizjak ---
(In reply to Richard Biener from comment #18)
> Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not
> sure if we should somehow do this late somehow (peephole or splitter) since
> it
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #19 from Richard Biener ---
So to recover performance we need both, avoiding the latency on the vector plus
avoiding the spilling. This variant is fast:
.L56:
.cfi_restore_state
vmovdqu (%rsi), %xmm4
movq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #18 from Richard Biener ---
There's another thing - we end up with
vmovq %rax, %xmm3
vpinsrq $1, %rdx, %xmm3, %xmm0
but that has way worse latency than the alternative you'd get w/o SSE 4.1:
vmovq %rax,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Richard Biener changed:
What|Removed |Added
CC||vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #16 from Jakub Jelinek ---
Created attachment 50142
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50142=edit
gcc11-pr98856.patch
Full patch.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #15 from Jakub Jelinek ---
The needed permutations for this boil down to
typedef int V __attribute__((vector_size (16)));
typedef int W __attribute__((vector_size (32)));
#ifdef __clang__
V f1 (V x) { return __builtin_shufflevector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #14 from Jakub Jelinek ---
WIP that implements that. Except that we need some permutation expansion
improvements, both for the SSE2 V4SImode permutation cases and for AVX2
V8SImode permutation cases.
--- gcc/config/i386/sse.md.jj
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #13 from Jakub Jelinek ---
Looking at what other compilers emit for this, ICC seems to be completely
broken, it emits logical right shifts instead of arithmetic right shift, and
LLVM trunk emits for >> 63 what this patch emits, for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #12 from Jakub Jelinek ---
V4DImode arithmetic right shifts would be (untested):
--- gcc/config/i386/sse.md.jj 2021-02-05 14:32:44.175463716 +0100
+++ gcc/config/i386/sse.md 2021-02-05 15:24:37.942026401 +0100
@@ -12458,7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Jakub Jelinek changed:
What|Removed |Added
CC||uros at gcc dot gnu.org
--- Comment #11
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #10 from Richard Biener ---
(In reply to Jakub Jelinek from comment #9)
> For arithmetic >> (element_precision - 1) one can just use
> {,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec < 0;
> (in C++-ish way),
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Jakub Jelinek changed:
What|Removed |Added
CC||jakub at gcc dot gnu.org
--- Comment #9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #8 from Richard Biener ---
exploring more options I noticed there's no arithmetic vector V2DI right shift,
so vectorizing
uint64_t carry = (uint64_t)(((int64_t)W[1]) >> 63) & (uint64_t)135;
W[1] = (W[1] << 1) ^
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #7 from Richard Biener ---
OK, and the spill is likely because we expand as
(insn 7 6 0 (set (reg:TI 84 [ _9 ])
(mem:TI (reg/v/f:DI 93 [ in ]) [0 MEM <__int128 unsigned> [(char *
{ref-all})in_8(D)]+0 S16 A8])) -1
(nil))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #6 from Richard Biener ---
The following testcase reproduces the assembly:
typedef __UINT64_TYPE__ uint64_t;
void poly_double_le2 (unsigned char *out, const unsigned char *in)
{
uint64_t W[2];
__builtin_memcpy (, in, 16);
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #5 from Richard Biener ---
Looks like STLF issues. There's a ls_stlf counter, with SLP vectorization
disabled I see
34.39% 1417 botanlibbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #4 from Richard Biener ---
Slow:
Samples: 4K of event 'cycles:u', Event count (approx.): 4565667242
Overhead Samples Command Shared Object Symbol
30.88% 1252 botan
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #3 from Martin Liška ---
(In reply to Richard Biener from comment #2)
> The cxx bench Botan doesn't know --cxxflags, what Botan version are you
> looking at?
I used this fixed version:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #2 from Richard Biener ---
The cxx bench Botan doesn't know --cxxflags, what Botan version are you looking
at?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Richard Biener changed:
What|Removed |Added
Status|NEW |ASSIGNED
Target Milestone|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Martin Liška changed:
What|Removed |Added
Known to fail||11.0
Last reconfirmed|
39 matches
Mail list logo