I wrote: > I experimented with a few different ideas such as adding restrict > decoration to the pointers, and eventually found that what works > is to write the loop termination condition as "i2 < limit" > rather than "i2 <= limit". It took me a long time to think of > trying that, because it seemed ridiculously stupid. But it works.
I've done more testing and confirmed that both gcc and clang can vectorize the improved loop on aarch64 as well as x86_64. (clang's results can be confusing because -ftree-vectorize doesn't seem to have any effect: its vectorizer is on by default. But if you use -fno-vectorize it'll go back to the old, slower code.) The only buildfarm effect I've noticed is that locust and prairiedog, which are using nearly the same ancient gcc version, complain c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing is ignored when Auto Vectorization is used. which is expected (they say the same for checksum.c), but then there are a bunch of warning: dereferencing type-punned pointer will break strict-aliasing rules which seems worrisome. (This sort of thing is the reason I'm hesitant to apply higher optimization levels across the board.) Both animals pass the regression tests anyway, but if any other compilers treat -ftree-vectorize as an excuse to apply stricter optimization assumptions, we could be in for trouble. I looked closer and saw that all of those warnings are about init_var(), and this change makes them go away: -#define init_var(v) MemSetAligned(v, 0, sizeof(NumericVar)) +#define init_var(v) memset(v, 0, sizeof(NumericVar)) I'm a little inclined to commit that as future-proofing. It's essentially reversing out a micro-optimization I made in d72f6c750. I doubt I had hard evidence that it made any noticeable difference; and even if it did back then, modern compilers probably prefer the memset approach. regards, tom lane