There is 3rd option:-)
typedef double __attribute__((aligned(16))) double16;
void foo (double16 *a, double16 *b, int N) {
double *p = (double16*)a;
double *q = (double16*)b;
int i;
for (i = 0; i < N; ++i)
p[i] = q[i] + 1;
}
On 11/30/2012 04:14 PM, Hal Finkel wrote:
Hi everyone,
Many compilers provide a way, through either pragmas or intrinsics, for the
user to assert stronger alignment requirements on a pointee than is otherwise
implied by the pointer's type. gcc now provides an intrinsic for this purpose,
__builtin_assume_aligned, and the attached patches (one for Clang and one for
LLVM) implement that intrinsic using a corresponding LLVM intrinsic, and
provide an infrastructure to take advantage of this new information.
** BEGIN justification -- skip this if you don't care ;) **
First, let me provide some justification. It is currently possible in Clang,
using gcc-style (or C++11-style) attributes, to create typedefs with stronger
alignment requirements than the original type. This is a useful feature, but it
has shortcomings. First, for the purpose of allowing the compiler to create
vectorized code with aligned loads and stores, they are awkward to use, and
even more-awkward to use correctly. For example, if I have as a base case:
foo (double *a, double *b) {
for (int i = 0; i < N; ++i)
a[i] = b[i] + 1;
}
and I want to say that a and b are both 16-byte aligned, I can write instead:
typedef double __attribute__((aligned(16))) double16;
foo (double16 *a, double16 *b) {
for (int i = 0; i < N; ++i)
a[i] = b[i] + 1;
}
and this might work; the loads and stores will be tagged as 16-byte aligned, and we
can vectorize the loop into, for example, a loop over <2 x double>. The problem
is that the code is now incorrect: it implies that *all* of the loads and stores are
16-byte aligned, and this is not true. Only every-other one is 16-byte aligned. It is
possible to correct this problem by manually unrolling the loop by a factor of 2:
foo (double16 *a, double16 *b) {
for (int i = 0; i < N; i += 2) {
a[i] = b[i] + 1;
((double *) a)[i+1] = ((double *) b)[i+1] + 1;
}
}
but this is awkward and error-prone.
With the intrinsic, this is easier:
foo (double *a, double *b) {
a = __builtin_assume_aligned(a, 16);
b = __builtin_assume_aligned(b, 16);
for (int i = 0; i < N; ++i)
a[i] = b[i] + 1;
}
this code can be vectorized with aligned loads and stores, and even if it is
not vectorized, will remain correct.
The second problem with the purely type-based approach, is that it requires
manual loop unrolling an inlining. Because the intrinsics are evaluated after
inlining (and after loop unrolling), the optimizer can use the alignment
assumptions specified in the caller when generating code for an inlined callee.
This is a very important capability.
The need to apply the alignment assumptions after inlining and loop unrolling
necessitate placing most of the infrastructure for this into LLVM; with Clang
only generating LLVM intrinsics. In addition, to take full advantage of the
information provided, it is necessary to look at loop-dependent pointer offsets
and strides; ScalarEvoltution provides the appropriate framework for doing this.
** END justification **
Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM intrinsic
is:
<t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment, <int t2>
offset)
which asserts that the address returned is offset bytes above an address with
the specified alignment. The attached patch makes some simple changes to
several analysis passes (like BasicAA and SE) to allow them to 'look through'
the intrinsic. It also adds a transformation pass that propagates the alignment
assumptions to loads and stores directly dependent on the intrinsics's return
value. Once this is done, the intrinsics are removed so that they don't
interfere with the remaining optimizations.
The patches are attached. I've also uploaded these to llvm-reviews (this is my
first time trying this, so please let me know if I should do something
differently):
Clang - http://llvm-reviews.chandlerc.com/D149
LLVM - http://llvm-reviews.chandlerc.com/D150
Please review.
Nadav, One shortcoming of the current patch is that, while it will work to
vectorize loops using unroll+bb-vectorize, it will not automatically work with
the loop vectorizer. To really be effective, the transformation pass needs to
run after loop unrolling; and loop unrolling is (and should be) run after loop
vectorization. Even if run prior to loop vectorization, it would not directly
help the loop vectorizer because the necessary strided loads and stores don't
yet exist. As a second step, I think we should split the current transformation
pass into a transformation pass and an analysis pass. This analysis pass can
then be used by the loop vectorizer (and any other early passes that want the
information) before the final rewriting and intrinsic deletion is done.
Thanks again,
Hal
_______________________________________________
llvm-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits