On Tue, Sep 16, 2025 at 3:07 PM Robin Dapp <rdapp....@gmail.com> wrote: > > > I think this now conflicts a bit with what I just pushed (sorry). > > > >> && loop_vinfo) > >> { > >> + unsigned i, j; > >> + bool simple_perm_series = true; > >> + FOR_EACH_VEC_ELT (SLP_TREE_LOAD_PERMUTATION (slp_node), i, j) > >> + if (i != j) > >> + simple_perm_series = false; > > > > In particular I disallow all load permutes, since we are going to elide it. > > The load permutation in the x264 case is simple, {0, 1, 2, 3}, {4, 5, 6, 7}, > etc. If we're disallowing these as well here there is no point in having the > new function ;) > > So we're not allowing load permutations here because we would elide (=not > generate code for) them via > > /* ??? The following checks should really be part of > > get_load_store_type. */ > > if (SLP_TREE_LOAD_PERMUTATION (slp_node).exists () > even though perm_ok = true?
Yes. > Isn't this all single-element, single-lane anyway? Well, what you want to catch now isn't single-lane anymore. But I guess since we now check the permute before this we can rely on check for n_perms == 0 to catch the "no actual permutation required" case? > Can I re-allow simple permutations (as they are nops anyway)? Or rather add a > vect_transform_slp_perm_load in the mat_gather_scatter_p part of > vectorizable_load? So can you pass down n_perms and check that instead of your loop over the permutation? > The case I was trying to avoid (for now) when disabling load permutations > is > > for (int i = 0; i < 512; ++i) > { > x[2*i] = y[1023 - (2*i)]; > x[2*i+1] = y[1023 - (2*i+1)]; > } > > (load permutation {1, 0}). Yeah, given we do not apply permutations after gathering the vector (because, with gathers, there's never any load permutation ... until now). > > Re-doing this after the above is a bit ugly. Likewise having pun_vectype. > > It's also an opportunity to move the VMAT_STRIDED_SLP punning and > > alignment (re-)computation code here. OTOH this is complicated enough > > already. > > I can try but I'm not sure it will be any less complex. > > > I believe you now have to change > > > > /* ??? The following checks should really be part of > > get_load_store_type. */ > > if (SLP_TREE_LOAD_PERMUTATION (slp_node).exists () > > && !((memory_access_type == VMAT_ELEMENTWISE > > || mat_gather_scatter_p (memory_access_type)) > > && SLP_TREE_LANES (slp_node) == 1 > > && (!grouped_load > > || !DR_GROUP_NEXT_ELEMENT (first_stmt_info)))) > > { > > > > as otherwise you'll get slp_perm set but the permute will be silently > > elided. This is all a bit ugly already :/ > > So we're setting slp_perm but know that we don't act on it in > VMAT_ELEMENTWISE? So for VMAT_ELEMENTWISE we _do_ apply a permutation (not sure that works though, too many guard rails around VMAT_ELEMENTWISE). But when we do not run into the above body we have slp_perm = false and skip applying a permute. The above tries to catch the old strided load case and the VMAT_ELEMENTWISE we fall back to when we cannot materialize the permute. I don't like very much how this is spread over multiple places ... Ideally VMAT_ELEMENTWISE would support all permutes by means of directly honoring it during vector construction, so it shouldn't matter whether vect_transform_slp_perm_load would succeed ot fail. But that's not done yet. I think I'd want to split VMAT_ELEMENTWISE from VMAT_STRIDED_SLP for this, but until I'll try I don't know whether I'll like that. Given we now have gathers that can have a permutation - like the following: > > > > Now I wonder if/how we handle > > > > for (i = 0; i < n; ++i) > > { > > int j = offset[i]; > > sum += data[2*i] + data[2*i+1]; > > } could be such one, it probably makes sense to handle load-permutation materialization after the gather operation. OTOH I'd like to get rid of load-permutations, but ... > > aka a gather grouped load. Maybe there's an opportunity to handle > > the "punning" higher up, representing this as a single-element group > > in the first place. Hmm. > > > > Anyway, I think the general direction of the patch is OK. You'll have to > > figure what I just broke though and I'm still somewhat missing the > > "beauty" moment when thinking of how the VMAT_STRIDED_SLP > > > > /* ??? Modify local copies of alignment_support_scheme and > > misalignment, but this part of analysis should be done > > earlier and remembered, likewise the chosen load mode. */ > > > > parts resolve themselves into happiness with this ... ;) > > I'm not sure I'm capable of producing a beauty moment in the vectorizer ;) Heh, I wasn't expecting you to produce beauty - it's not that I don't fail there as well :/ > but I'll try how ugly it gets when consolidating the type punning. Yep, I was just thinking that as it's similar (and we've come from doing the strided load in that code path to the gather one), we might put up something re-usable. Richard. > > -- > Regards > Robin >