On Wed, Jun 14, 2023 at 5:32 PM Jan Beulich <jbeul...@suse.com> wrote: > > On 14.06.2023 10:10, Hongtao Liu wrote: > > On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > >> > >> There's no reason to constrain this to AVX512VL, as the wider operation > >> is not usable for more narrow operands only when the possible memory > > But this may require more resources (on AMD znver4 processor a zmm > > instruction will also be split into 2 uops, right?) And on some intel > > processors(SKX/CLX) there will be frequency reduction. > > I'm afraid I don't follow: Largely the same AVX512 code would be > generated when passing -mavx512vl, so how can power/performance > considerations matter here? All I'm doing here (and in a few more Yes , for -march=*** is ok since AVX512VL is included. what your patch improve is -mavx512f -mno-avx512vl, but for specific option combinations like -mavx512f -mprefer-vector-width=256 -mno-avx512vl, your patch will produce zmm instruction which is not expected. > patches I'm still in the process of testing) is relax when AVX512 > insns can actually be used (reducing the copying between registers > and/or the number of insns needed). My understanding on the Intel > side is that it only matters whether AVX512 insns are used, not No, vector length matters, ymm/xmm evex insns are ok to use, but zmm insns will cause frequency reduction. > what vector length they are. You may be right about znver4, though. > > Nevertheless I agree ... > > > If it needs to be done, it is better guarded with > > !TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL > > or users explicitly uses -mprefer-vector-width=256, we don't want to > > produce any zmm instruction for surprise.(Although > > -mprefer-vector-width=256 is supposed for auto-vectorizer, but backend > > codegen also use it under such cases, i.e. in *movsf_internal > > alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.) > > ... that respecting such overrides is probably desirable, so I'll > adjust. > > Jan > > >> source is a non-broadcast one. This way even the scalar copysign<mode>3 > >> can benefit from the operation being a single-insn one (leaving aside > >> moves which the compiler decides to insert for unclear reasons, and > >> leaving aside the fact that bcst_mem_operand() is too restrictive for > >> broadcast to be embedded right into VPTERNLOG*). > >> > >> Along with this also request value duplication in > >> ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating > >> excess space allocation in .rodata.*, filled with zeros which are never > >> read. > >> > >> gcc/ > >> > >> * config/i386/i386-expand.cc (ix86_expand_copysign): Request > >> value duplication by ix86_build_signbit_mask() when AVX512F and > >> not HFmode. > >> * config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to > >> 2-alternative form. Adjust "mode" attribute. Add "enabled" > >> attribute. > >> (*<avx512>_vpternlog<mode>_1): Relax to just TARGET_AVX512F. > >> (*<avx512>_vpternlog<mode>_2): Likewise. > >> (*<avx512>_vpternlog<mode>_3): Likewise. >
-- BR, Hongtao