... though I suspect to really profit from masked vectorization like this,
it needs to be tackled at a much lower level in the compiler, likely even
as an LLVM optimization pass, guided only by some hints from Julia itself.

*Sebastian Good*


On Wed, Sep 24, 2014 at 10:16 AM, Sebastian Good <
[email protected]> wrote:

> I've been thinking about this a bit, and as usual, Julia's multiple
> dispatch might make such a thing possible in a novel way. The heart of ISPC
> is allowing a function that looks like
>
> int addScalar (int a, int b) { return a + b; }
>
> effectively be
>
> vector<int> addVector (vector<int> a, vector<int> b) { return /*AVX
> version of */a + b; }
>
> This is what vectorizing compilers do, but they don't handle control flow
> like ISPC does. Also, ISPCs "foreach" and "foreach_tiled" allow these
> vectorized functions to be consumed more efficiently, for instance by
> handling the ragged/unaligned front and back of arrays with scalar
> versions, and the middle bits with vectorized functions.
>
> With support for hardware vectors in Julia, you can start to imagine
> writing macros that automatically generate the relevant functions, e.g.
> generating AddVector from addScalar. However, to do anything cleverer than
> the (already extremely clever) LLVM vectorizer, you have to expose masking
> operations. To handle incoherent/divergent control flow, you issue vector
> operations that are masked, allowing some lanes of the vector to stop
> participating in the program for a period.  In a contrived example
>
> int addScalar(int a, int b) { return a % 2 ? a + b : a - b; }
>
> would be turned into something like the below
>
> vector<int> addVector(vector<int> a, vector<int> b) {
>   mask = all; // a register with all 1s, indicating all lanes participate
>   int mod = a % 2; // vectorized, using mask
>   mask = maskwhere(mod != 0);
>   vector<int> result = a + b; // vectorized, using mask
>   mask = invert(mask);
>   result = a - b; // vectorized, using mask
>   return result;
> }
>
> If you look at it closely, you've got versions generated for each function
> that are
> - scalar
> - vector-enabled, but for arbitrary length vectors
> - specialized for (one or more hardware) vector sizes
> - specialized by alignment (as vector sizes get bigger, e.g. the 32- and
> 64-byte AVX versions coming out, you can't just rely on the runtime to
> align everything properly, it will be too wasteful)
>
> So, I think it's a big ask, but I think it could be produced
> incrementally. We'd need help from the Julia language/standard library
> itself to expose masked vector operations.
>
>
> *Sebastian Good*
>
>
> On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller <[email protected]> wrote:
>
>> Could this theoretical thing be approached incrementally?  Meaning here's
>> a project and he's some intermediate results and now it's 1.5x faster, and
>> now he's something better and it's 2.7 all the while the goal is apparent
>> but difficult.
>>
>> Or would it kind of be all works or doesn't?
>>
>
>

Reply via email to