On Tue, Jun 3, 2025 at 2:59 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > Extend the remove_redundant_vector pass to handle vector broadcasts from > constant and variable scalars. When broadcasting from constants and > function arguments, we can place a single widest vector broadcast at > entry of the nearest common dominator for basic blocks with all uses > since constants and function arguments aren't changed. For broadcast > from variables with a single definition, the single definition is > replaced with the widest broadcast. > > gcc/ > > PR target/92080 > * config/i386/i386-expand.cc (ix86_expand_call): Set > recursive_function to true for recursive call. > * config/i386/i386-features.cc (ix86_place_single_vector_set): > Add an argument for inner scalar, default to nullptr. Set the > source from inner scalar if not nullptr. > (ix86_get_vector_load_mode): Add an argument for scalar mode and > handle integer and float scalar modes. > (replace_vector_const): Add an argument for scalar mode and pass > it to ix86_get_vector_load_mode. > (redundant_load_kind): New. > (redundant_load): Likewise. > (ix86_broadcast_inner): Likewise. > (remove_redundant_vector_load): Also support const0_rtx and > constm1_rtx broadcasts. Handle vector broadcasts from constant > and variable scalars. > * config/i386/i386.h (machine_function): Add recursive_function. > > gcc/testsuite/ > > * gcc.target/i386/keylocker-aesdecwide128kl.c: Updated to expect > movdqa instead pxor. > * gcc.target/i386/keylocker-aesdecwide256kl.c: Likewise. > * gcc.target/i386/keylocker-aesencwide128kl.c: Likewise. > * gcc.target/i386/keylocker-aesencwide256kl.c: Likewise. > * gcc.target/i386/pr92080-4.c: New test. > * gcc.target/i386/pr92080-5.c: Likewise. > * gcc.target/i386/pr92080-6.c: Likewise. > * gcc.target/i386/pr92080-7.c: Likewise. > * gcc.target/i386/pr92080-8.c: Likewise. > * gcc.target/i386/pr92080-9.c: Likewise. > * gcc.target/i386/pr92080-10.c: Likewise. > * gcc.target/i386/pr92080-11.c: Likewise. > * gcc.target/i386/pr92080-12.c: Likewise. > * gcc.target/i386/pr92080-13.c: Likewise. > * gcc.target/i386/pr92080-14.c: Likewise. > * gcc.target/i386/pr92080-15.c: Likewise. > * gcc.target/i386/pr92080-16.c: Likewise.
>+ machine_mode mode = VOIDmode; >+ fixed_size_mode candidate; >+ FOR_EACH_MODE_IN_CLASS (mode, vklass) >+ if (is_a<fixed_size_mode> (mode, &candidate) >+ && GET_MODE_INNER (candidate) == scalar_mode >+ && GET_MODE_SIZE (candidate) == size) >+ return mode; >+ >+ gcc_unreachable (); Can we just use default_vectorize_related_mode to get the wanted mode, or reuse the code in it? 1591/* The default implementation of TARGET_VECTORIZE_RELATED_MODE. */ 1592 1593opt_machine_mode 1594default_vectorize_related_mode (machine_mode vector_mode, 1595 scalar_mode element_mode, 1596 poly_uint64 nunits) 1597{ 1598 machine_mode result_mode; 1599 if ((maybe_ne (nunits, 0U) 1600 || multiple_p (GET_MODE_SIZE (vector_mode), 1601 GET_MODE_SIZE (element_mode), &nunits)) 1602 && mode_for_vector (element_mode, nunits).exists (&result_mode) 1603 && VECTOR_MODE_P (result_mode) 1604 && targetm.vector_mode_supported_p (result_mode)) 1605 return result_mode; 1606 1607 return opt_machine_mode (); 1608} > + else if (CONST_VECTOR_P (op)) >+ { >+ rtx first = XVECEXP (op, 0, 0); >+ for (int i = 1; i < nunits; ++i) >+ { >+ rtx tmp = XVECEXP (op, 0, i); >+ /* Vector duplicate value. */ >+ if (!rtx_equal_p (tmp, first)) >+ return nullptr; >+ } >+ if (!CONSTANT_P (first)) >+ return nullptr; Is it really needed? Do we have a case where CONST_VECTOR_P has a non-constant component? Also I assume that allsame && CONST_VECTOR_P is already handled in ix86_expand_vector_init, so do we really need to handle it here? 17725 /* If all values are identical, broadcast the value. */ 17726 if (all_same 17727 && ix86_expand_vector_init_duplicate (mmx_ok, mode, target, 17728 XVECEXP (vals, 0, 0))) 17729 return; >+ /* Check if there is a matching redundant vector load. */ >+ bool matched = false; >+ FOR_EACH_VEC_ELT (loads, i, load) It's expensive. Can we try with a hash like cse_insn? >+ if (load->val >+ && load->kind == kind >+ && load->mode == scalar_mode >+ /* Since CONST_INT load doesn't need memory, it must >+ be in the same basic block if it is in a recursive >+ call. */ This part is a bit tricky, It's used to avoid some regression which I guess just exposes some latent issue? And I didn't see any reason why recursive_call_p needs to be excluded for CONST_INT load and same bb. >+ && (!recursive_call_p >+ || load->bb == bb >+ || !(CONST_INT_P (load->val) >+ && load->kind == LOAD_VECTOR)) >+ && rtx_equal_p (load->val, val)) >+ { > > -- > H.J. -- BR, Hongtao