On Fri, Nov 30, 2012 at 6:44 PM, Alex Rosenberg <[email protected]> wrote: > I'd love a more general assume mechanism that other optimizations could use. > e.g. alignment would simply be an available (x & mask) expression for the > suitable passes to take advantage of.
http://llvm.org/PR810 for some history/context/support for this kind of thing > > Sent from my iPad > > On Nov 30, 2012, at 4:14 PM, Hal Finkel <[email protected]> wrote: > >> Hi everyone, >> >> Many compilers provide a way, through either pragmas or intrinsics, for the >> user to assert stronger alignment requirements on a pointee than is >> otherwise implied by the pointer's type. gcc now provides an intrinsic for >> this purpose, __builtin_assume_aligned, and the attached patches (one for >> Clang and one for LLVM) implement that intrinsic using a corresponding LLVM >> intrinsic, and provide an infrastructure to take advantage of this new >> information. >> >> ** BEGIN justification -- skip this if you don't care ;) ** >> First, let me provide some justification. It is currently possible in Clang, >> using gcc-style (or C++11-style) attributes, to create typedefs with >> stronger alignment requirements than the original type. This is a useful >> feature, but it has shortcomings. First, for the purpose of allowing the >> compiler to create vectorized code with aligned loads and stores, they are >> awkward to use, and even more-awkward to use correctly. For example, if I >> have as a base case: >> foo (double *a, double *b) { >> for (int i = 0; i < N; ++i) >> a[i] = b[i] + 1; >> } >> and I want to say that a and b are both 16-byte aligned, I can write instead: >> typedef double __attribute__((aligned(16))) double16; >> foo (double16 *a, double16 *b) { >> for (int i = 0; i < N; ++i) >> a[i] = b[i] + 1; >> } >> and this might work; the loads and stores will be tagged as 16-byte aligned, >> and we can vectorize the loop into, for example, a loop over <2 x double>. >> The problem is that the code is now incorrect: it implies that *all* of the >> loads and stores are 16-byte aligned, and this is not true. Only every-other >> one is 16-byte aligned. It is possible to correct this problem by manually >> unrolling the loop by a factor of 2: >> foo (double16 *a, double16 *b) { >> for (int i = 0; i < N; i += 2) { >> a[i] = b[i] + 1; >> ((double *) a)[i+1] = ((double *) b)[i+1] + 1; >> } >> } >> but this is awkward and error-prone. >> >> With the intrinsic, this is easier: >> foo (double *a, double *b) { >> a = __builtin_assume_aligned(a, 16); >> b = __builtin_assume_aligned(b, 16); >> for (int i = 0; i < N; ++i) >> a[i] = b[i] + 1; >> } >> this code can be vectorized with aligned loads and stores, and even if it is >> not vectorized, will remain correct. >> >> The second problem with the purely type-based approach, is that it requires >> manual loop unrolling an inlining. Because the intrinsics are evaluated >> after inlining (and after loop unrolling), the optimizer can use the >> alignment assumptions specified in the caller when generating code for an >> inlined callee. This is a very important capability. >> >> The need to apply the alignment assumptions after inlining and loop >> unrolling necessitate placing most of the infrastructure for this into LLVM; >> with Clang only generating LLVM intrinsics. In addition, to take full >> advantage of the information provided, it is necessary to look at >> loop-dependent pointer offsets and strides; ScalarEvoltution provides the >> appropriate framework for doing this. >> ** END justification ** >> >> Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM >> intrinsic is: >> <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment, <int t2> >> offset) >> which asserts that the address returned is offset bytes above an address >> with the specified alignment. The attached patch makes some simple changes >> to several analysis passes (like BasicAA and SE) to allow them to 'look >> through' the intrinsic. It also adds a transformation pass that propagates >> the alignment assumptions to loads and stores directly dependent on the >> intrinsics's return value. Once this is done, the intrinsics are removed so >> that they don't interfere with the remaining optimizations. >> >> The patches are attached. I've also uploaded these to llvm-reviews (this is >> my first time trying this, so please let me know if I should do something >> differently): >> Clang - http://llvm-reviews.chandlerc.com/D149 >> LLVM - http://llvm-reviews.chandlerc.com/D150 >> >> Please review. >> >> Nadav, One shortcoming of the current patch is that, while it will work to >> vectorize loops using unroll+bb-vectorize, it will not automatically work >> with the loop vectorizer. To really be effective, the transformation pass >> needs to run after loop unrolling; and loop unrolling is (and should be) run >> after loop vectorization. Even if run prior to loop vectorization, it would >> not directly help the loop vectorizer because the necessary strided loads >> and stores don't yet exist. As a second step, I think we should split the >> current transformation pass into a transformation pass and an analysis pass. >> This analysis pass can then be used by the loop vectorizer (and any other >> early passes that want the information) before the final rewriting and >> intrinsic deletion is done. >> >> Thanks again, >> Hal >> >> -- >> Hal Finkel >> Postdoctoral Appointee >> Leadership Computing Facility >> Argonne National Laboratory >> <asal-clang-20121130.patch> >> <asal-llvm-20121130.patch> >> _______________________________________________ >> llvm-commits mailing list >> [email protected] >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits > > _______________________________________________ > llvm-commits mailing list > [email protected] > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
