Hi Juzhe, thanks looks pretty comprehensive already.
> +(define_expand "vec_perm<mode>" > + [(match_operand:V 0 "register_operand") > + (match_operand:V 1 "register_operand") > + (match_operand:V 2 "register_operand") > + (match_operand:<VINDEX> 3 "vector_perm_operand")] > + "TARGET_VECTOR && GET_MODE_NUNITS (<MODE>mode).is_constant ()" > + { > + riscv_vector::expand_vec_perm (operands[0], operands[1], > + operands[2], operands[3]); > + DONE; > + } > +) IMHO this would be the perfect use of expand_vec_perm (operands) instead of the individual arguments (also the way aarch64 does it) but your call in the end. > +/* Return true if VEC is a constant in which every element is in the range > + [MINVAL, MAXVAL]. The elements do not need to have the same value. */ > + > +static bool > +const_vec_all_in_range_p (rtx vec, HOST_WIDE_INT minval, HOST_WIDE_INT > maxval) > +{ > + if (!CONST_VECTOR_P (vec) > + || GET_MODE_CLASS (GET_MODE (vec)) != MODE_VECTOR_INT) > + return false; > + > + int nunits; > + if (!CONST_VECTOR_STEPPED_P (vec)) > + nunits = const_vector_encoded_nelts (vec); > + else if (!CONST_VECTOR_NUNITS (vec).is_constant (&nunits)) > + return false; > + > + for (int i = 0; i < nunits; i++) > + { > + rtx vec_elem = CONST_VECTOR_ELT (vec, i); > + if (!CONST_INT_P (vec_elem) > + || !IN_RANGE (INTVAL (vec_elem), minval, maxval)) > + return false; > + } > + return true; > +} > + > +/* Return a const_int vector of VAL. */ > + > +static rtx > +gen_const_vector_dup (machine_mode mode, HOST_WIDE_INT val) > +{ > + rtx c = gen_int_mode (val, GET_MODE_INNER (mode)); > + return gen_const_vec_duplicate (mode, c); > +} > + Both functions are taken from aarch64. I would suggest adding comments in that respect in case somebody wants to unify that in the middle-end at some point (and wonders if they really do the same thing). > +/* This function emits VLMAX vrgather instruction. Emit vrgather.vx/vi when > sel > + is a const duplicate vector. Otherwise, emit vrgather.vv. */ > +static void > +emit_vlmax_gather_insn (rtx target, rtx op, rtx sel) > +{ > + rtx elt; > + insn_code icode; > + machine_mode data_mode = GET_MODE (target); > + if (const_vec_duplicate_p (sel, &elt)) > + { > + icode = code_for_pred_gather_scalar (data_mode); > + sel = elt; > + } > + else > + icode = code_for_pred_gather (data_mode); > + rtx ops[] = {target, op, sel}; > + emit_vlmax_insn (icode, RVV_BINOP, ops); > +} > + > +static void > +emit_vlmax_masked_gather_mu_insn (rtx target, rtx op, rtx sel, rtx mask) > +{ > + rtx elt; > + insn_code icode; > + machine_mode data_mode = GET_MODE (target); > + if (const_vec_duplicate_p (sel, &elt)) > + { > + icode = code_for_pred_gather_scalar (data_mode); > + sel = elt; > + } > + else > + icode = code_for_pred_gather (data_mode); > + rtx ops[] = {target, mask, target, op, sel}; > + emit_vlmax_masked_mu_insn (icode, RVV_BINOP_MU, ops); > +} > + > +/* Implement vec_perm<mode>. */ > + > +void > +expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel) > +{ > + machine_mode data_mode = GET_MODE (target); > + machine_mode sel_mode = GET_MODE (sel); > + /* Enforced by the pattern condition. */ This is supposed to refer to the const-ness of sel_mode? > + /* Note: vec_perm indices are supposed to wrap when they go beyond the > + size of the two value vectors, i.e. the upper bits of the indices > + are effectively ignored. RVV vrgather instead produces 0 for any > + out-of-range indices, so we need to modulo all the vec_perm indices > + to ensure they are all in range. */ > + int nunits = GET_MODE_NUNITS (sel_mode).to_constant (); Ah, saw it in the aarch64 version, it makes more sense when arranged the same way they did it so I'd suggest using their comment/code order. Where is the modulo actually happening, though? > + /* Check if the two values vectors are the same. */ > + if (rtx_equal_p (op0, op1) || const_vec_duplicate_p (sel)) > + { > + rtx max_sel = gen_const_vector_dup (sel_mode, nunits - 1); > + rtx sel_mod = expand_simple_binop (sel_mode, AND, sel, max_sel, NULL, > 0, > + OPTAB_DIRECT); Ah, here. But why not also further down? I mean it gets clearer when reading the whole function a second time but a bit of a strategy comment upfront would help clear up misunderstandings. Just thinking out loudly as I read it: > + rtx max_sel = gen_const_vector_dup (sel_mode, nunits);> + > + /* Step 1: generate a mask that should select second vector. */ > + expand_vec_cmp (mask, GEU, sel, max_sel); So we select everything >= nunits into the mask. > + /* Step2: gather a intermediate result for index < nunits, > + we don't need to care about the result of the element > + whose index >= nunits. */ > + emit_vlmax_gather_insn (target, op0, sel); Then gather every op0 indexed by sel into target. Values indexed >= nunits are zeroed by vgather. > + > + /* Step3: generate a new selector for second vector. */ > + rtx tmp = gen_reg_rtx (sel_mode); > + rtx ops[] = {tmp, sel, max_sel}; > + emit_vlmax_insn (code_for_pred (MINUS, sel_mode), RVV_BINOP, ops); Now we shift the range from (nunits, max_of_mode] to [0, max_of_mode - nunits]. > + /* Step4: merge the result of index >= nunits. */ > + emit_vlmax_masked_gather_mu_insn (target, op1, tmp, mask); > +} And now we gather those into the previously masked-out elements of target. Where do we actually ensure that indices >= 2 * nunits are used modulo and not zeroed? If it's somewhere I didn't spot it and it certainly warrants a comment. Are we actually testing the wraparound/modulo indices? I didn't immediately see it but might have overlooked something. If so, please add a comment in the test itself what we expect to happen or scan for an and or so. Is builtin_shufflevector undefined for values >= 2 * nunits? That would explain it partially. Regards Robin