On Wednesday, 7 September 2016 at 00:21:23 UTC, Manu wrote:
The end of a scan line is special cased . If I need 12 pixels for the last iteration but there are only 8 left, an instance of Kernel::InputVector is allocated on stack, 8 remaining pixels are memcpy into it then send to the kernel. Output from kernel are also assigned to a stack variable first, then memcpy 8 pixels to the output buffer.

Right, and this is a classic problem with this sort of function; it is
only more efficient if numElements is suitable long.
See, I often wonder if it would be worth being able to provide both functions, a scalar and array version, and have the algorithms select
between them intelligently.

We normally process full HD or higher resolution images so the overhead of having to copy the last iteration was negligible.

It was fairly easy to put together a scalar version as they are much easier to write than the SIMD ones. In fact I had scalar version for every SIMD kernel, and use them for unit testing.

It shouldn't be hard to have the framework look at the buffer size and choose the scalar version when number of elements are small, it wasn't done that way simply because we didn't need it.

Reply via email to