https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120383
Bug ID: 120383 Summary: Improving early break unrolled sequences with Adv. SIMD Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Blocks: 53947, 115130 Target Milestone: --- Target: aarch64* Today if we unroll an early break loop such as: #define N 640 long long a[N] = {}; long long b[N] = {}; int f1 (long long c) { for (int i = 0; i < N; i++) { if (a[i] == c) return 1; } return 0; } we generate an ORR reduction followed by a compression sequence when using Adv. SIMD. ldp q31, q30, [x1], 32 cmeq v31.2d, v31.2d, v27.2d cmeq v30.2d, v30.2d, v27.2d orr v31.16b, v31.16b, v30.16b umaxp v31.4s, v31.4s, v31.4s fmov x4, d31 cbz x4, .L2 fmov w1, s29 However the dependency chain here is long enough that it removes the vector profitability. This sequence can however be replaced by: ldp q31, q30, [x0], 32 cmeq v31.2d, v31.2d, v29.2d cmeq v30.2d, v30.2d, v29.2d addhn v31.2s, v31.2d, v30.2d fmov x2, d31 cbz x2, .L15 with addhn replacing the ORR reduction and the UMAXP compression. When using 3 compare statements, we can use nested addhn, when 4 we can create 2 ORRs + ADDHN. The AArch64 ADDHN (Add and Halve Narrow) instruction performs an addition of two vectors of the same size, then truncates (narrow) the result by keeping only the higher half of each element. This means that this is hard to do in RTL as you wouldn't be able to match this long sequence with combine, and the intermediate combinations aren't valid. for instance it's only possible when the vector mask values are 0 and -1 and so we would need to know that the values in the registers are vector mask values. An RTL folder won't work either as it won't let us get the nested variant. Which leaves a gimple folder or support in the vectorizer for this. By far the simplest version is using the vectorizer, as it knows about mask type (e.g. VECTOR_BOOLEAN_P (..)) , it knows about the precision of the mask type and it's the one generating the sequence so it can choose how to do the reductions. However for this to work we have to introduce an optab for addhn. Open coding the sequence doesn't work as we don't have a way to describe that the addition is done as a higher precision. With the optab the final codegen can generate a scalar cbranch rather than a vector one and gets the result we want. This makes unrolling an early break loop much more profitable on AArch64. Having looked into the options and describing the limitations above, are you ok with a new optab for addhn Richi? Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130 [Bug 115130] [meta-bug] early break vectorization