https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123225
--- Comment #6 from Victor Do Nascimento <victorldn at gcc dot gnu.org> --- Thanks for the feedback, both in terms of code examples and observations regarding the prologue peeling expense. Also, sorry for the slow turnaround time. After the holidays, I've been ramping up on the code for the loop costing. I figured the easiest way (though I've yet to convince myself it's the right way) to tweak which uncounted loops we accept for vectorization is to replicate what we do if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP, where we check min_profitable_estimate against some constant, e.g. vect_vf_for_cost (loop_vinfo). Even using the vect_vf_for_cost (loop_vinfo) as for VECT_COST_MODEL_VERY_CHEAP in the uncounted loop criterion allows us to recover 86% of the increase in code-size for 523.xalancbmk_r and most of the performance degradation we observe in AArch64 (though admittedly the performance loss is considerably smaller for AArch64 than it is for x86_64). I'll try other cut off values (Richi mentioned about vector loop being less than 2x expensive as a single scalar iteration, while I had thought half of vect_vf_for_cost) and report back, tough equally any feedback on my as of yet rudimentary approach to the problem is most welcome.
