I'd love a more general assume mechanism that other optimizations could use. e.g. alignment would simply be an available (x & mask) expression for the suitable passes to take advantage of.
Sent from my iPad On Nov 30, 2012, at 4:14 PM, Hal Finkel <[email protected]> wrote: > Hi everyone, > > Many compilers provide a way, through either pragmas or intrinsics, for the > user to assert stronger alignment requirements on a pointee than is otherwise > implied by the pointer's type. gcc now provides an intrinsic for this > purpose, __builtin_assume_aligned, and the attached patches (one for Clang > and one for LLVM) implement that intrinsic using a corresponding LLVM > intrinsic, and provide an infrastructure to take advantage of this new > information. > > ** BEGIN justification -- skip this if you don't care ;) ** > First, let me provide some justification. It is currently possible in Clang, > using gcc-style (or C++11-style) attributes, to create typedefs with stronger > alignment requirements than the original type. This is a useful feature, but > it has shortcomings. First, for the purpose of allowing the compiler to > create vectorized code with aligned loads and stores, they are awkward to > use, and even more-awkward to use correctly. For example, if I have as a base > case: > foo (double *a, double *b) { > for (int i = 0; i < N; ++i) > a[i] = b[i] + 1; > } > and I want to say that a and b are both 16-byte aligned, I can write instead: > typedef double __attribute__((aligned(16))) double16; > foo (double16 *a, double16 *b) { > for (int i = 0; i < N; ++i) > a[i] = b[i] + 1; > } > and this might work; the loads and stores will be tagged as 16-byte aligned, > and we can vectorize the loop into, for example, a loop over <2 x double>. > The problem is that the code is now incorrect: it implies that *all* of the > loads and stores are 16-byte aligned, and this is not true. Only every-other > one is 16-byte aligned. It is possible to correct this problem by manually > unrolling the loop by a factor of 2: > foo (double16 *a, double16 *b) { > for (int i = 0; i < N; i += 2) { > a[i] = b[i] + 1; > ((double *) a)[i+1] = ((double *) b)[i+1] + 1; > } > } > but this is awkward and error-prone. > > With the intrinsic, this is easier: > foo (double *a, double *b) { > a = __builtin_assume_aligned(a, 16); > b = __builtin_assume_aligned(b, 16); > for (int i = 0; i < N; ++i) > a[i] = b[i] + 1; > } > this code can be vectorized with aligned loads and stores, and even if it is > not vectorized, will remain correct. > > The second problem with the purely type-based approach, is that it requires > manual loop unrolling an inlining. Because the intrinsics are evaluated after > inlining (and after loop unrolling), the optimizer can use the alignment > assumptions specified in the caller when generating code for an inlined > callee. This is a very important capability. > > The need to apply the alignment assumptions after inlining and loop unrolling > necessitate placing most of the infrastructure for this into LLVM; with Clang > only generating LLVM intrinsics. In addition, to take full advantage of the > information provided, it is necessary to look at loop-dependent pointer > offsets and strides; ScalarEvoltution provides the appropriate framework for > doing this. > ** END justification ** > > Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM intrinsic > is: > <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment, <int t2> > offset) > which asserts that the address returned is offset bytes above an address with > the specified alignment. The attached patch makes some simple changes to > several analysis passes (like BasicAA and SE) to allow them to 'look through' > the intrinsic. It also adds a transformation pass that propagates the > alignment assumptions to loads and stores directly dependent on the > intrinsics's return value. Once this is done, the intrinsics are removed so > that they don't interfere with the remaining optimizations. > > The patches are attached. I've also uploaded these to llvm-reviews (this is > my first time trying this, so please let me know if I should do something > differently): > Clang - http://llvm-reviews.chandlerc.com/D149 > LLVM - http://llvm-reviews.chandlerc.com/D150 > > Please review. > > Nadav, One shortcoming of the current patch is that, while it will work to > vectorize loops using unroll+bb-vectorize, it will not automatically work > with the loop vectorizer. To really be effective, the transformation pass > needs to run after loop unrolling; and loop unrolling is (and should be) run > after loop vectorization. Even if run prior to loop vectorization, it would > not directly help the loop vectorizer because the necessary strided loads and > stores don't yet exist. As a second step, I think we should split the current > transformation pass into a transformation pass and an analysis pass. This > analysis pass can then be used by the loop vectorizer (and any other early > passes that want the information) before the final rewriting and intrinsic > deletion is done. > > Thanks again, > Hal > > -- > Hal Finkel > Postdoctoral Appointee > Leadership Computing Facility > Argonne National Laboratory > <asal-clang-20121130.patch> > <asal-llvm-20121130.patch> > _______________________________________________ > llvm-commits mailing list > [email protected] > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
