On Tuesday, 6 September 2016 at 03:08:43 UTC, Manu wrote:
I still stand by this, and I listed some reasons above.
Auto-vectorisation is a nice opportunistic optimisation, but it can't be relied on. The key reason is that scalar arithmetic semantics are
different than vector semantics, and auto-vectorisation tends to
produce a whole bunch of extra junk code to carefully (usually
pointlessly) preserve the scalar semantics that it's trying to
vectorise. This will never end well.
But the vectorisation isn't the interesting problem here, I'm really just interested in how to work these batch-processing functions into our nice modern pipeline statements without placing an unreasonable burden on the end-user, who shouldn't be expected to go out of their
way. If they even have to start manually chunking, I think we've
already lost; they won't know optimal chunk-sizes, or anything about
alignment boundaries, cache, etc.

In a previous job I had successfully created a small c++ library to perform pipelined SIMD image processing. Not sure how relevant it is but think I'd share the design here, perhaps it'll give you guys some ideas.

Basically the users of this library only need to write simple kernel classes, something like this:

// A kernel that processes 4 pixels at a time
struct MySimpleKernel : Kernel<4>
{
    // Tell the library the input and output type
    using InputVector  = Vector<__m128, 1>;
    using OutputVector = Vector<__m128, 2>;

    template<typename T>
    OutputVector apply(const T& src)
    {
        // T will be deduced to Vector<__m128, 1>
        // which is an array of one __m128 element
        // Awesome SIMD code goes here...
        // And return the output vector
        return OutputVector(...);
    }
};

Of course the InputVector and OutputVector do not have to be __m128, they can totally be other types like int or float.

The cool thing is kernels can be chained together with >> operators.

So assume we have another kernel:

struct AnotherKernel : Kernel<3>
{
...
}

Then we can create a processing pipeline with these 2 kernels:

InputBuffer(...) >> MySimpleKernel() >> AnotherKernel() >> OutputBuffer(...);

Then some template magic will figure out the LCM of the 2 kernels' pixel width is 3*4=12 and therefore they are fused together into a composite kernel of pixel width 12. The above line compiles down into a single function invokation, with a main loop that reads the source buffer in 4 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.

Any number of kernels can be chained together in this way, as long as your compiler doesn't explode.

At that time, my benchmarks showed pipelines generated in this way often rivals the speed of hand tuned loops.

Reply via email to