----- Original Message ----- > From: "Erik Schwiebert" <[email protected]> > To: "Hal Finkel" <[email protected]> > Cc: "llvm cfe" <[email protected]> > Sent: Saturday, December 1, 2012 6:40:38 PM > Subject: Re: [cfe-commits] [llvm-commits] [PATCH] __builtin_assume_aligned > for Clang and LLVM > > I'm not familiar very familiar with gcc's practices; I know that > clang currently allows using the 'aligned' attribute to weaken the > alignment of the base type of a pointer: > > typedef __attribute__((aligned(1))) double weakly_aligned_double; > weakly_aligned_double *wd; > *wd = 1.0; // will not use asm instructions that require 8-byte > alignment > > Microsoft is (sadly, in my personal) using this to support legacy > file formats where data is packed very tightly and the code written > for the era assumes it can weaken pointers safely. It's probably a > good thing to not add new ways to violate the language specs. :) I > would definitely prefer a warning rather than silently doing nothing > if the code requests an alignment weaker than the native type > requires. > > It would be good to check against alignment of structures being > passed in as well. > > typedef struct > { > double d; > int i; > } MyStruct; > > MyStruct *m; > m = __builtin_assume_aligned(m, 4); // reject because the struct > needs to be aligned 8 for the internal double.
I'm not sure whether or not this should be an error. One could "get around" the error by casting the returned pointer to a different type with a weaker type alignment. I agree that it should at least be a warning. Thanks again, Hal > > Thanks, > Schwieb > > On Nov 30, 2012, at 9:41 PM, Hal Finkel <[email protected]> wrote: > > > ----- Original Message ----- > >> From: "Erik Schwiebert" <[email protected]> > >> To: "Alex Rosenberg" <[email protected]>, "Hal Finkel" > >> <[email protected]> > >> Cc: "llvm-commits" <[email protected]>, "llvm cfe" > >> <[email protected]> > >> Sent: Friday, November 30, 2012 11:16:18 PM > >> Subject: Re: [cfe-commits] [llvm-commits] [PATCH] > >> __builtin_assume_aligned for Clang and LLVM > >> > >> What happens if someone uses the intrinsic to *weaken* the > >> alignment > >> of the given type? i.e., > >> > >> double *d; > >> d = __builtin_assume_aligned(d, 1); > >> > >> C99 6.3.2.3p7 and C++11 [expr.reinterpret.cast]p17 say it is > >> undefined (an issue with which I am all too familiar with these > >> days). Should the compiler reject the code or allow it? I note > >> that the compiler allows __attribute__((aligned(1))), so this > >> should > >> probably also be allowed but anyone using it must proceed with > >> caution. > > > > The code added to LLVM will ignore this assumption. I did this in > > order to follow (what I believe to be) gcc's convention: alignment > > attributes and intrinsics can only strengthen alignment > > requirements while 'packing' attributes can weaken them. Do you > > think that this is the right approach for us to follow? In your > > example, Clang could certainly issue a warning. That is probably a > > good thing for it to do. > > > > I forgot to mention this in the documentation; thanks for pointing > > this out. > > > > -Hal > > > >> > >> Schwieb > >> > >> On Nov 30, 2012, at 6:44 PM, Alex Rosenberg <[email protected]> > >> wrote: > >> > >>> I'd love a more general assume mechanism that other optimizations > >>> could use. e.g. alignment would simply be an available (x & mask) > >>> expression for the suitable passes to take advantage of. > >>> > >>> Sent from my iPad > >>> > >>> On Nov 30, 2012, at 4:14 PM, Hal Finkel <[email protected]> wrote: > >>> > >>>> Hi everyone, > >>>> > >>>> Many compilers provide a way, through either pragmas or > >>>> intrinsics, for the user to assert stronger alignment > >>>> requirements on a pointee than is otherwise implied by the > >>>> pointer's type. gcc now provides an intrinsic for this purpose, > >>>> __builtin_assume_aligned, and the attached patches (one for > >>>> Clang > >>>> and one for LLVM) implement that intrinsic using a corresponding > >>>> LLVM intrinsic, and provide an infrastructure to take advantage > >>>> of this new information. > >>>> > >>>> ** BEGIN justification -- skip this if you don't care ;) ** > >>>> First, let me provide some justification. It is currently > >>>> possible > >>>> in Clang, using gcc-style (or C++11-style) attributes, to create > >>>> typedefs with stronger alignment requirements than the original > >>>> type. This is a useful feature, but it has shortcomings. First, > >>>> for the purpose of allowing the compiler to create vectorized > >>>> code with aligned loads and stores, they are awkward to use, and > >>>> even more-awkward to use correctly. For example, if I have as a > >>>> base case: > >>>> foo (double *a, double *b) { > >>>> for (int i = 0; i < N; ++i) > >>>> a[i] = b[i] + 1; > >>>> } > >>>> and I want to say that a and b are both 16-byte aligned, I can > >>>> write instead: > >>>> typedef double __attribute__((aligned(16))) double16; > >>>> foo (double16 *a, double16 *b) { > >>>> for (int i = 0; i < N; ++i) > >>>> a[i] = b[i] + 1; > >>>> } > >>>> and this might work; the loads and stores will be tagged as > >>>> 16-byte aligned, and we can vectorize the loop into, for > >>>> example, > >>>> a loop over <2 x double>. The problem is that the code is now > >>>> incorrect: it implies that *all* of the loads and stores are > >>>> 16-byte aligned, and this is not true. Only every-other one is > >>>> 16-byte aligned. It is possible to correct this problem by > >>>> manually unrolling the loop by a factor of 2: > >>>> foo (double16 *a, double16 *b) { > >>>> for (int i = 0; i < N; i += 2) { > >>>> a[i] = b[i] + 1; > >>>> ((double *) a)[i+1] = ((double *) b)[i+1] + 1; > >>>> } > >>>> } > >>>> but this is awkward and error-prone. > >>>> > >>>> With the intrinsic, this is easier: > >>>> foo (double *a, double *b) { > >>>> a = __builtin_assume_aligned(a, 16); > >>>> b = __builtin_assume_aligned(b, 16); > >>>> for (int i = 0; i < N; ++i) > >>>> a[i] = b[i] + 1; > >>>> } > >>>> this code can be vectorized with aligned loads and stores, and > >>>> even if it is not vectorized, will remain correct. > >>>> > >>>> The second problem with the purely type-based approach, is that > >>>> it > >>>> requires manual loop unrolling an inlining. Because the > >>>> intrinsics are evaluated after inlining (and after loop > >>>> unrolling), the optimizer can use the alignment assumptions > >>>> specified in the caller when generating code for an inlined > >>>> callee. This is a very important capability. > >>>> > >>>> The need to apply the alignment assumptions after inlining and > >>>> loop unrolling necessitate placing most of the infrastructure > >>>> for > >>>> this into LLVM; with Clang only generating LLVM intrinsics. In > >>>> addition, to take full advantage of the information provided, it > >>>> is necessary to look at loop-dependent pointer offsets and > >>>> strides; ScalarEvoltution provides the appropriate framework for > >>>> doing this. > >>>> ** END justification ** > >>>> > >>>> Mirroring the gcc (and now Clang) intrinsic, the corresponding > >>>> LLVM intrinsic is: > >>>> <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 > >>>> alignment, <int t2> offset) > >>>> which asserts that the address returned is offset bytes above an > >>>> address with the specified alignment. The attached patch makes > >>>> some simple changes to several analysis passes (like BasicAA and > >>>> SE) to allow them to 'look through' the intrinsic. It also adds > >>>> a > >>>> transformation pass that propagates the alignment assumptions to > >>>> loads and stores directly dependent on the intrinsics's return > >>>> value. Once this is done, the intrinsics are removed so that > >>>> they > >>>> don't interfere with the remaining optimizations. > >>>> > >>>> The patches are attached. I've also uploaded these to > >>>> llvm-reviews > >>>> (this is my first time trying this, so please let me know if I > >>>> should do something differently): > >>>> Clang - http://llvm-reviews.chandlerc.com/D149 > >>>> LLVM - http://llvm-reviews.chandlerc.com/D150 > >>>> > >>>> Please review. > >>>> > >>>> Nadav, One shortcoming of the current patch is that, while it > >>>> will > >>>> work to vectorize loops using unroll+bb-vectorize, it will not > >>>> automatically work with the loop vectorizer. To really be > >>>> effective, the transformation pass needs to run after loop > >>>> unrolling; and loop unrolling is (and should be) run after loop > >>>> vectorization. Even if run prior to loop vectorization, it would > >>>> not directly help the loop vectorizer because the necessary > >>>> strided loads and stores don't yet exist. As a second step, I > >>>> think we should split the current transformation pass into a > >>>> transformation pass and an analysis pass. This analysis pass can > >>>> then be used by the loop vectorizer (and any other early passes > >>>> that want the information) before the final rewriting and > >>>> intrinsic deletion is done. > > > > -- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
