[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org ---

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #37 from Richard Biener --- So my analysis was partly wrong and the vpinsrq isn't an issue for the benchmark but only the spilling is. Note that the other idea of disparaging vector CTORs more like with diff --git

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #36 from Richard Biener --- (In reply to Richard Biener from comment #35) > (In reply to Richard Biener from comment #33) > > Created attachment 50308 [details] > > patch > > > > I am testing the following. > > It FAILs > > FAIL:

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #35 from Richard Biener --- (In reply to Richard Biener from comment #33) > Created attachment 50308 [details] > patch > > I am testing the following. It FAILs FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #34 from Uroš Bizjak --- (In reply to rguent...@suse.de from comment #32) > what about reload_completed? We really only want to do this after RA. No need for it, this is peephole2 pass that *always* runs after reload.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #33 from Richard Biener --- Created attachment 50308 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50308=edit patch I am testing the following.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #32 from rguenther at suse dot de --- On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 > > --- Comment #31 from Uroš Bizjak --- > (In reply to Richard Biener from comment

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #31 from Uroš Bizjak --- (In reply to Richard Biener from comment #29) > The simplified variant below works but IMHO matches cases we do not > want to transform. I can't find any example on how to achieve that > though. I think

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #30 from Jakub Jelinek --- (In reply to Richard Biener from comment #29) > I suppose the reason is that there's two unrelated insns between the > xmm0 = cx:DI and the vec_concat. Which would hint that we somehow > need to not match

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #29 from Richard Biener --- (In reply to Uroš Bizjak from comment #27) > (In reply to Richard Biener from comment #26) > > but that doesn't seem to match for some unknown reason. > > Try this: > > (define_peephole2 >

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #28 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #27) > (In reply to Richard Biener from comment #26) > > but that doesn't seem to match for some unknown reason. > Try this: The latency problem with the original

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #27 from Uroš Bizjak --- (In reply to Richard Biener from comment #26) > but that doesn't seem to match for some unknown reason. Try this: (define_peephole2 [(match_scratch:DI 5 "Yv") (set (match_operand:DI 0

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #26 from Richard Biener --- (In reply to rguent...@suse.de from comment #25) > On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 > > > > --- Comment #24 from Uroš Bizjak

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #25 from rguenther at suse dot de --- On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 > > --- Comment #24 from Uroš Bizjak --- > (In reply to Richard Biener from comment

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #24 from Uroš Bizjak --- (In reply to Richard Biener from comment #22) > That works to avoid the vpinsrq. I guess the case of a mem operand > behaves similar to a gpr (plus the load uop), at least I don't have any > contrary

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #23 from Richard Biener --- Created attachment 50300 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50300=edit preprocessed source of the important Botan TU This is the full preprocessed source of the TU. When compiled with

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #22 from Richard Biener --- (In reply to Uroš Bizjak from comment #21) > (In reply to Uroš Bizjak from comment #20) > > (In reply to Richard Biener from comment #18) > > > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #21 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #20) > (In reply to Richard Biener from comment #18) > > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not > > sure if we should somehow do this

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #20 from Uroš Bizjak --- (In reply to Richard Biener from comment #18) > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not > sure if we should somehow do this late somehow (peephole or splitter) since > it

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #19 from Richard Biener --- So to recover performance we need both, avoiding the latency on the vector plus avoiding the spilling. This variant is fast: .L56: .cfi_restore_state vmovdqu (%rsi), %xmm4 movq

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #18 from Richard Biener --- There's another thing - we end up with vmovq %rax, %xmm3 vpinsrq $1, %rdx, %xmm3, %xmm0 but that has way worse latency than the alternative you'd get w/o SSE 4.1: vmovq %rax,

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 Richard Biener changed: What|Removed |Added CC||vmakarov at gcc dot gnu.org

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-02-08 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #16 from Jakub Jelinek --- Created attachment 50142 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50142=edit gcc11-pr98856.patch Full patch.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-02-05 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #15 from Jakub Jelinek --- The needed permutations for this boil down to typedef int V __attribute__((vector_size (16))); typedef int W __attribute__((vector_size (32))); #ifdef __clang__ V f1 (V x) { return __builtin_shufflevector

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-02-05 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #14 from Jakub Jelinek --- WIP that implements that. Except that we need some permutation expansion improvements, both for the SSE2 V4SImode permutation cases and for AVX2 V8SImode permutation cases. --- gcc/config/i386/sse.md.jj

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-02-05 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #13 from Jakub Jelinek --- Looking at what other compilers emit for this, ICC seems to be completely broken, it emits logical right shifts instead of arithmetic right shift, and LLVM trunk emits for >> 63 what this patch emits, for

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-02-05 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #12 from Jakub Jelinek --- V4DImode arithmetic right shifts would be (untested): --- gcc/config/i386/sse.md.jj 2021-02-05 14:32:44.175463716 +0100 +++ gcc/config/i386/sse.md 2021-02-05 15:24:37.942026401 +0100 @@ -12458,7

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-02-05 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 Jakub Jelinek changed: What|Removed |Added CC||uros at gcc dot gnu.org --- Comment #11

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-02-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #10 from Richard Biener --- (In reply to Jakub Jelinek from comment #9) > For arithmetic >> (element_precision - 1) one can just use > {,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec < 0; > (in C++-ish way),

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-02-05 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #9

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-02-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #8 from Richard Biener --- exploring more options I noticed there's no arithmetic vector V2DI right shift, so vectorizing uint64_t carry = (uint64_t)(((int64_t)W[1]) >> 63) & (uint64_t)135; W[1] = (W[1] << 1) ^

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-01-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #7 from Richard Biener --- OK, and the spill is likely because we expand as (insn 7 6 0 (set (reg:TI 84 [ _9 ]) (mem:TI (reg/v/f:DI 93 [ in ]) [0 MEM <__int128 unsigned> [(char * {ref-all})in_8(D)]+0 S16 A8])) -1 (nil))

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-01-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #6 from Richard Biener --- The following testcase reproduces the assembly: typedef __UINT64_TYPE__ uint64_t; void poly_double_le2 (unsigned char *out, const unsigned char *in) { uint64_t W[2]; __builtin_memcpy (, in, 16);

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-01-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #5 from Richard Biener --- Looks like STLF issues. There's a ls_stlf counter, with SLP vectorization disabled I see 34.39% 1417 botanlibbotan-2.so.17 [.] Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul,

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-01-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #4 from Richard Biener --- Slow: Samples: 4K of event 'cycles:u', Event count (approx.): 4565667242 Overhead Samples Command Shared Object Symbol 30.88% 1252 botan

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-01-28 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #3 from Martin Liška --- (In reply to Richard Biener from comment #2) > The cxx bench Botan doesn't know --cxxflags, what Botan version are you > looking at? I used this fixed version:

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #2 from Richard Biener --- The cxx bench Botan doesn't know --cxxflags, what Botan version are you looking at?

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED Target Milestone|---

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-01-27 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 Martin Liška changed: What|Removed |Added Known to fail||11.0 Last reconfirmed|