... though I suspect to really profit from masked vectorization like this, it needs to be tackled at a much lower level in the compiler, likely even as an LLVM optimization pass, guided only by some hints from Julia itself.
*Sebastian Good* On Wed, Sep 24, 2014 at 10:16 AM, Sebastian Good < [email protected]> wrote: > I've been thinking about this a bit, and as usual, Julia's multiple > dispatch might make such a thing possible in a novel way. The heart of ISPC > is allowing a function that looks like > > int addScalar (int a, int b) { return a + b; } > > effectively be > > vector<int> addVector (vector<int> a, vector<int> b) { return /*AVX > version of */a + b; } > > This is what vectorizing compilers do, but they don't handle control flow > like ISPC does. Also, ISPCs "foreach" and "foreach_tiled" allow these > vectorized functions to be consumed more efficiently, for instance by > handling the ragged/unaligned front and back of arrays with scalar > versions, and the middle bits with vectorized functions. > > With support for hardware vectors in Julia, you can start to imagine > writing macros that automatically generate the relevant functions, e.g. > generating AddVector from addScalar. However, to do anything cleverer than > the (already extremely clever) LLVM vectorizer, you have to expose masking > operations. To handle incoherent/divergent control flow, you issue vector > operations that are masked, allowing some lanes of the vector to stop > participating in the program for a period. In a contrived example > > int addScalar(int a, int b) { return a % 2 ? a + b : a - b; } > > would be turned into something like the below > > vector<int> addVector(vector<int> a, vector<int> b) { > mask = all; // a register with all 1s, indicating all lanes participate > int mod = a % 2; // vectorized, using mask > mask = maskwhere(mod != 0); > vector<int> result = a + b; // vectorized, using mask > mask = invert(mask); > result = a - b; // vectorized, using mask > return result; > } > > If you look at it closely, you've got versions generated for each function > that are > - scalar > - vector-enabled, but for arbitrary length vectors > - specialized for (one or more hardware) vector sizes > - specialized by alignment (as vector sizes get bigger, e.g. the 32- and > 64-byte AVX versions coming out, you can't just rely on the runtime to > align everything properly, it will be too wasteful) > > So, I think it's a big ask, but I think it could be produced > incrementally. We'd need help from the Julia language/standard library > itself to expose masked vector operations. > > > *Sebastian Good* > > > On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller <[email protected]> wrote: > >> Could this theoretical thing be approached incrementally? Meaning here's >> a project and he's some intermediate results and now it's 1.5x faster, and >> now he's something better and it's 2.7 all the while the goal is apparent >> but difficult. >> >> Or would it kind of be all works or doesn't? >> > >
