Re: Taking pipeline processing to the next level

finalpatch via Digitalmars-d Tue, 06 Sep 2016 07:26:28 -0700

On Tuesday, 6 September 2016 at 03:08:43 UTC, Manu wrote:

I still stand by this, and I listed some reasons above.
Auto-vectorisation is a nice opportunistic optimisation, but itcan'tbe relied on. The key reason is that scalar arithmeticsemantics are
different than vector semantics, and auto-vectorisation tends to
produce a whole bunch of extra junk code to carefully (usually
pointlessly) preserve the scalar semantics that it's trying to
vectorise. This will never end well.
But the vectorisation isn't the interesting problem here, I'mreallyjust interested in how to work these batch-processing functionsintoour nice modern pipeline statements without placing anunreasonableburden on the end-user, who shouldn't be expected to go out oftheir
way. If they even have to start manually chunking, I think we've
already lost; they won't know optimal chunk-sizes, or anythingabout
alignment boundaries, cache, etc.

In a previous job I had successfully created a small c++ libraryto perform pipelined SIMD image processing. Not sure how relevantit is but think I'd share the design here, perhaps it'll give youguys some ideas.

Basically the users of this library only need to write simplekernel classes, something like this:


// A kernel that processes 4 pixels at a time
struct MySimpleKernel : Kernel<4>
{
    // Tell the library the input and output type
    using InputVector  = Vector<__m128, 1>;
    using OutputVector = Vector<__m128, 2>;

    template<typename T>
    OutputVector apply(const T& src)
    {
        // T will be deduced to Vector<__m128, 1>
        // which is an array of one __m128 element
        // Awesome SIMD code goes here...
        // And return the output vector
        return OutputVector(...);
    }
};

Of course the InputVector and OutputVector do not have to be__m128, they can totally be other types like int or float.

The cool thing is kernels can be chained together with >>operators.


So assume we have another kernel:

struct AnotherKernel : Kernel<3>
{
...
}

Then we can create a processing pipeline with these 2 kernels:

InputBuffer(...) >> MySimpleKernel() >> AnotherKernel() >>OutputBuffer(...);

Then some template magic will figure out the LCM of the 2kernels' pixel width is 3*4=12 and therefore they are fusedtogether into a composite kernel of pixel width 12. The aboveline compiles down into a single function invokation, with a mainloop that reads the source buffer in 4 pixels step, callMySimpleKernel 3 times, then call AnotherKernel 4 times.

Any number of kernels can be chained together in this way, aslong as your compiler doesn't explode.

At that time, my benchmarks showed pipelines generated in thisway often rivals the speed of hand tuned loops.

Re: Taking pipeline processing to the next level

Reply via email to