You are probably right. For me is the first time doing this sort of stuff. But the feeling that I have is that even small modifications in the algorithm (avoiding conversions as much as possible, for example) has a significant impact. I think that the reason is that I am interating on 100_000 frames of 640x480 pixels.
For instance I have just reached 1200fps just by replacing how the rows numbers are calculated. Changing this: template clamp(val:cint, max_val:cint):untyped = min( max(val, 0), max_val).uint ..... for row1 in 0..<height: let row0 = clamp(row1-1, h) let row2 = clamp(row1+1, h) ...... Run into: iterator `...`[T](ini:T,`end`:T):tuple[a:uint,b:uint,c:uint] = let ini = ini.uint let `end` = `end`.uint yield (a:ini,b:ini,c:ini+1) for i in ini+1..`end`-1: yield (a:i-1,b:i,c:i+1) yield (a:`end`-1,b:`end`,c:`end`) ... for (row0, row1, row2) in 0...h: .... Run The calculation now looks like this: #import math import ../vapoursynth import math iterator `...`[T](ini:T,`end`:T):tuple[a:uint,b:uint,c:uint] = let ini = ini.uint let `end` = `end`.uint yield (a:ini,b:ini,c:ini+1) for i in ini+1..`end`-1: yield (a:i-1,b:i,c:i+1) yield (a:`end`-1,b:`end`,c:`end`) proc apply_kernel*(src:ptr VSFrameRef, dst:ptr VSFrameRef, kernel:array[9, int32], mul:int, den:int) = #let n = (( math.sqrt(kernel.len.float).int - 1 ) / 2).int for pn in 0..<src.numPlanes: # pn= Plane Number # These cost 60fps (if I take them outside of this loop) let height = src.height( pn ) let width = src.width( pn ) let h = (height-1) let w = (width-1) for (row0, row1, row2) in 0...h: let r0 = src[pn, row0] let r1 = src[pn, row1] let r2 = src[pn, row2] let w1 = dst[pn, row1] for (col0, col1, col2) in 0...w: let value:int32 = r0[col0].int32 + r0[col1].int32 * 2 + r0[col2].int32 + r1[col0].int32 * 2 + r1[col1].int32 * 4 + r1[col2].int32 * 2 + r2[col0].int32 + r2[col1].int32 * 2 + r2[col2].int32 w1[col1] = (value / den).uint8 Run and I have managed to go from the original 80fps to 1200fps. But I am still very far from what I should be able to reach. The analogous C++ code (using float instead of int32) for reference is way faster: auto srcp = reinterpret_cast<const float*>(vsapi->getReadPtr(frame, plane)); auto dstp = reinterpret_cast<float*>(vsapi->getWritePtr(dst, plane)); auto h = vsapi->getFrameHeight(frame, plane); auto w = vsapi->getFrameWidth(frame, plane); auto gauss = [&](auto y, auto x) { auto clamp = [&](auto val, auto bound) { return std::min(std::max(val, 0), bound - 1); }; auto above = srcp + clamp(y - 1, h) * w; auto current = srcp + y * w; auto below = srcp + clamp(y + 1, h) * w; auto conv = above[clamp(x - 1, w)] + above[x] * 2 + above[clamp(x + 1, w)] + current[clamp(x - 1, w)] * 2 + current[x] * 4 + current[clamp(x + 1, w)] * 2 + below[clamp(x - 1, w)] + below[x] * 2 + below[clamp(x + 1, w)]; return conv / 16; }; for (auto y = 0; y < h; y++) for (auto x = 0; x < w; x++) (dstp + y * w)[x] = gauss(y, x); Run When you mention memory bottleneck, what do you mean? Where should I look at? Thanks a lot