I'd love a more general assume mechanism that other optimizations could use. 
e.g. alignment would simply be an available (x & mask) expression for the 
suitable passes to take advantage of.

Sent from my iPad

On Nov 30, 2012, at 4:14 PM, Hal Finkel <[email protected]> wrote:

> Hi everyone,
> 
> Many compilers provide a way, through either pragmas or intrinsics, for the 
> user to assert stronger alignment requirements on a pointee than is otherwise 
> implied by the pointer's type. gcc now provides an intrinsic for this 
> purpose, __builtin_assume_aligned, and the attached patches (one for Clang 
> and one for LLVM) implement that intrinsic using a corresponding LLVM 
> intrinsic, and provide an infrastructure to take advantage of this new 
> information.
> 
> ** BEGIN justification -- skip this if you don't care ;) **
> First, let me provide some justification. It is currently possible in Clang, 
> using gcc-style (or C++11-style) attributes, to create typedefs with stronger 
> alignment requirements than the original type. This is a useful feature, but 
> it has shortcomings. First, for the purpose of allowing the compiler to 
> create vectorized code with aligned loads and stores, they are awkward to 
> use, and even more-awkward to use correctly. For example, if I have as a base 
> case:
> foo (double *a, double *b) {
>  for (int i = 0; i < N; ++i)
>    a[i] = b[i] + 1;
> }
> and I want to say that a and b are both 16-byte aligned, I can write instead:
> typedef double __attribute__((aligned(16))) double16;
> foo (double16 *a, double16 *b) {
>  for (int i = 0; i < N; ++i)
>    a[i] = b[i] + 1;
> }
> and this might work; the loads and stores will be tagged as 16-byte aligned, 
> and we can vectorize the loop into, for example, a loop over <2 x double>. 
> The problem is that the code is now incorrect: it implies that *all* of the 
> loads and stores are 16-byte aligned, and this is not true. Only every-other 
> one is 16-byte aligned. It is possible to correct this problem by manually 
> unrolling the loop by a factor of 2:
> foo (double16 *a, double16 *b) {
>  for (int i = 0; i < N; i += 2) {
>    a[i] = b[i] + 1;
>    ((double *) a)[i+1] = ((double *) b)[i+1] + 1;
>  }
> }
> but this is awkward and error-prone.
> 
> With the intrinsic, this is easier:
> foo (double *a, double *b) {
>  a = __builtin_assume_aligned(a, 16);
>  b = __builtin_assume_aligned(b, 16);
>  for (int i = 0; i < N; ++i)
>    a[i] = b[i] + 1;
> }
> this code can be vectorized with aligned loads and stores, and even if it is 
> not vectorized, will remain correct.
> 
> The second problem with the purely type-based approach, is that it requires 
> manual loop unrolling an inlining. Because the intrinsics are evaluated after 
> inlining (and after loop unrolling), the optimizer can use the alignment 
> assumptions specified in the caller when generating code for an inlined 
> callee. This is a very important capability.
> 
> The need to apply the alignment assumptions after inlining and loop unrolling 
> necessitate placing most of the infrastructure for this into LLVM; with Clang 
> only generating LLVM intrinsics. In addition, to take full advantage of the 
> information provided, it is necessary to look at loop-dependent pointer 
> offsets and strides; ScalarEvoltution provides the appropriate framework for 
> doing this.
> ** END justification **
> 
> Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM intrinsic 
> is:
> <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment, <int t2> 
> offset)
> which asserts that the address returned is offset bytes above an address with 
> the specified alignment. The attached patch makes some simple changes to 
> several analysis passes (like BasicAA and SE) to allow them to 'look through' 
> the intrinsic. It also adds a transformation pass that propagates the 
> alignment assumptions to loads and stores directly dependent on the 
> intrinsics's return value. Once this is done, the intrinsics are removed so 
> that they don't interfere with the remaining optimizations.
> 
> The patches are attached. I've also uploaded these to llvm-reviews (this is 
> my first time trying this, so please let me know if I should do something 
> differently):
> Clang - http://llvm-reviews.chandlerc.com/D149
> LLVM - http://llvm-reviews.chandlerc.com/D150
> 
> Please review.
> 
> Nadav, One shortcoming of the current patch is that, while it will work to 
> vectorize loops using unroll+bb-vectorize, it will not automatically work 
> with the loop vectorizer. To really be effective, the transformation pass 
> needs to run after loop unrolling; and loop unrolling is (and should be) run 
> after loop vectorization. Even if run prior to loop vectorization, it would 
> not directly help the loop vectorizer because the necessary strided loads and 
> stores don't yet exist. As a second step, I think we should split the current 
> transformation pass into a transformation pass and an analysis pass. This 
> analysis pass can then be used by the loop vectorizer (and any other early 
> passes that want the information) before the final rewriting and intrinsic 
> deletion is done. 
> 
> Thanks again,
> Hal
> 
> -- 
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
> <asal-clang-20121130.patch>
> <asal-llvm-20121130.patch>
> _______________________________________________
> llvm-commits mailing list
> [email protected]
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Reply via email to