There is 3rd option:-)

 typedef double __attribute__((aligned(16))) double16;
void foo (double16 *a, double16 *b, int N) {
  double *p = (double16*)a;
  double *q = (double16*)b;
  int i;
  for (i = 0; i < N; ++i)
    p[i] = q[i] + 1;
}

On 11/30/2012 04:14 PM, Hal Finkel wrote:
Hi everyone,

Many compilers provide a way, through either pragmas or intrinsics, for the 
user to assert stronger alignment requirements on a pointee than is otherwise 
implied by the pointer's type. gcc now provides an intrinsic for this purpose, 
__builtin_assume_aligned, and the attached patches (one for Clang and one for 
LLVM) implement that intrinsic using a corresponding LLVM intrinsic, and 
provide an infrastructure to take advantage of this new information.

** BEGIN justification -- skip this if you don't care ;) **
First, let me provide some justification. It is currently possible in Clang, 
using gcc-style (or C++11-style) attributes, to create typedefs with stronger 
alignment requirements than the original type. This is a useful feature, but it 
has shortcomings. First, for the purpose of allowing the compiler to create 
vectorized code with aligned loads and stores, they are awkward to use, and 
even more-awkward to use correctly. For example, if I have as a base case:
foo (double *a, double *b) {
   for (int i = 0; i < N; ++i)
     a[i] = b[i] + 1;
}
and I want to say that a and b are both 16-byte aligned, I can write instead:
typedef double __attribute__((aligned(16))) double16;
foo (double16 *a, double16 *b) {
   for (int i = 0; i < N; ++i)
     a[i] = b[i] + 1;
}
and this might work; the loads and stores will be tagged as 16-byte aligned, and we 
can vectorize the loop into, for example, a loop over <2 x double>. The problem 
is that the code is now incorrect: it implies that *all* of the loads and stores are 
16-byte aligned, and this is not true. Only every-other one is 16-byte aligned. It is 
possible to correct this problem by manually unrolling the loop by a factor of 2:
foo (double16 *a, double16 *b) {
   for (int i = 0; i < N; i += 2) {
     a[i] = b[i] + 1;
     ((double *) a)[i+1] = ((double *) b)[i+1] + 1;
   }
}
but this is awkward and error-prone.

With the intrinsic, this is easier:
foo (double *a, double *b) {
   a = __builtin_assume_aligned(a, 16);
   b = __builtin_assume_aligned(b, 16);
   for (int i = 0; i < N; ++i)
     a[i] = b[i] + 1;
}
this code can be vectorized with aligned loads and stores, and even if it is 
not vectorized, will remain correct.

The second problem with the purely type-based approach, is that it requires 
manual loop unrolling an inlining. Because the intrinsics are evaluated after 
inlining (and after loop unrolling), the optimizer can use the alignment 
assumptions specified in the caller when generating code for an inlined callee. 
This is a very important capability.

The need to apply the alignment assumptions after inlining and loop unrolling 
necessitate placing most of the infrastructure for this into LLVM; with Clang 
only generating LLVM intrinsics. In addition, to take full advantage of the 
information provided, it is necessary to look at loop-dependent pointer offsets 
and strides; ScalarEvoltution provides the appropriate framework for doing this.
** END justification **

Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM intrinsic 
is:
<t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment, <int t2> 
offset)
which asserts that the address returned is offset bytes above an address with 
the specified alignment. The attached patch makes some simple changes to 
several analysis passes (like BasicAA and SE) to allow them to 'look through' 
the intrinsic. It also adds a transformation pass that propagates the alignment 
assumptions to loads and stores directly dependent on the intrinsics's return 
value. Once this is done, the intrinsics are removed so that they don't 
interfere with the remaining optimizations.

The patches are attached. I've also uploaded these to llvm-reviews (this is my 
first time trying this, so please let me know if I should do something 
differently):
Clang - http://llvm-reviews.chandlerc.com/D149
LLVM - http://llvm-reviews.chandlerc.com/D150

Please review.

Nadav, One shortcoming of the current patch is that, while it will work to 
vectorize loops using unroll+bb-vectorize, it will not automatically work with 
the loop vectorizer. To really be effective, the transformation pass needs to 
run after loop unrolling; and loop unrolling is (and should be) run after loop 
vectorization. Even if run prior to loop vectorization, it would not directly 
help the loop vectorizer because the necessary strided loads and stores don't 
yet exist. As a second step, I think we should split the current transformation 
pass into a transformation pass and an analysis pass. This analysis pass can 
then be used by the loop vectorizer (and any other early passes that want the 
information) before the final rewriting and intrinsic deletion is done.

Thanks again,
Hal



_______________________________________________
llvm-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Reply via email to