On Wed, Jun 14, 2023 at 5:32 PM Jan Beulich <jbeul...@suse.com> wrote:
>
> On 14.06.2023 10:10, Hongtao Liu wrote:
> > On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> There's no reason to constrain this to AVX512VL, as the wider operation
> >> is not usable for more narrow operands only when the possible memory
> > But this may require more resources (on AMD znver4 processor a zmm
> > instruction will also be split into 2 uops, right?) And on some intel
> > processors(SKX/CLX) there will be frequency reduction.
>
> I'm afraid I don't follow: Largely the same AVX512 code would be
> generated when passing -mavx512vl, so how can power/performance
> considerations matter here? All I'm doing here (and in a few more
Yes , for -march=*** is ok since AVX512VL is included.
what your patch improve is -mavx512f -mno-avx512vl, but for specific
option combinations like -mavx512f -mprefer-vector-width=256
-mno-avx512vl, your patch will produce zmm instruction which is not
expected.
> patches I'm still in the process of testing) is relax when AVX512
> insns can actually be used (reducing the copying between registers
> and/or the number of insns needed). My understanding on the Intel
> side is that it only matters whether AVX512 insns are used, not
No, vector length matters, ymm/xmm evex insns are ok to use, but zmm
insns will cause frequency reduction.
> what vector length they are. You may be right about znver4, though.
>
> Nevertheless I agree ...
>
> > If it needs to be done, it is better guarded with
> > !TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL
> > or users explicitly uses -mprefer-vector-width=256, we don't want to
> > produce any zmm instruction for surprise.(Although
> > -mprefer-vector-width=256 is supposed for auto-vectorizer, but backend
> > codegen also use it under such cases, i.e. in *movsf_internal
> > alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.)
>
> ... that respecting such overrides is probably desirable, so I'll
> adjust.
>
> Jan
>
> >> source is a non-broadcast one. This way even the scalar copysign<mode>3
> >> can benefit from the operation being a single-insn one (leaving aside
> >> moves which the compiler decides to insert for unclear reasons, and
> >> leaving aside the fact that bcst_mem_operand() is too restrictive for
> >> broadcast to be embedded right into VPTERNLOG*).
> >>
> >> Along with this also request value duplication in
> >> ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating
> >> excess space allocation in .rodata.*, filled with zeros which are never
> >> read.
> >>
> >> gcc/
> >>
> >>         * config/i386/i386-expand.cc (ix86_expand_copysign): Request
> >>         value duplication by ix86_build_signbit_mask() when AVX512F and
> >>         not HFmode.
> >>         * config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to
> >>         2-alternative form. Adjust "mode" attribute. Add "enabled"
> >>         attribute.
> >>         (*<avx512>_vpternlog<mode>_1): Relax to just TARGET_AVX512F.
> >>         (*<avx512>_vpternlog<mode>_2): Likewise.
> >>         (*<avx512>_vpternlog<mode>_3): Likewise.
>


-- 
BR,
Hongtao

Reply via email to