You are probably right. For me is the first time doing this sort of stuff. But
the feeling that I have is that even small modifications in the algorithm
(avoiding conversions as much as possible, for example) has a significant
impact. I think that the reason is that I am interating on 100_000 frames of
640x480 pixels.
For instance I have just reached 1200fps just by replacing how the rows numbers
are calculated. Changing this:
template clamp(val:cint, max_val:cint):untyped =
min( max(val, 0), max_val).uint
.....
for row1 in 0..<height:
let row0 = clamp(row1-1, h)
let row2 = clamp(row1+1, h)
......
Run
into:
iterator `...`[T](ini:T,`end`:T):tuple[a:uint,b:uint,c:uint] =
let ini = ini.uint
let `end` = `end`.uint
yield (a:ini,b:ini,c:ini+1)
for i in ini+1..`end`-1:
yield (a:i-1,b:i,c:i+1)
yield (a:`end`-1,b:`end`,c:`end`)
...
for (row0, row1, row2) in 0...h:
....
Run
The calculation now looks like this:
#import math
import ../vapoursynth
import math
iterator `...`[T](ini:T,`end`:T):tuple[a:uint,b:uint,c:uint] =
let ini = ini.uint
let `end` = `end`.uint
yield (a:ini,b:ini,c:ini+1)
for i in ini+1..`end`-1:
yield (a:i-1,b:i,c:i+1)
yield (a:`end`-1,b:`end`,c:`end`)
proc apply_kernel*(src:ptr VSFrameRef, dst:ptr VSFrameRef, kernel:array[9,
int32], mul:int, den:int) =
#let n = (( math.sqrt(kernel.len.float).int - 1 ) / 2).int
for pn in 0..<src.numPlanes: # pn= Plane Number
# These cost 60fps (if I take them outside of this loop)
let height = src.height( pn )
let width = src.width( pn )
let h = (height-1)
let w = (width-1)
for (row0, row1, row2) in 0...h:
let r0 = src[pn, row0]
let r1 = src[pn, row1]
let r2 = src[pn, row2]
let w1 = dst[pn, row1]
for (col0, col1, col2) in 0...w:
let value:int32 = r0[col0].int32 + r0[col1].int32 * 2 +
r0[col2].int32 +
r1[col0].int32 * 2 + r1[col1].int32 * 4 +
r1[col2].int32 * 2 +
r2[col0].int32 + r2[col1].int32 * 2 +
r2[col2].int32
w1[col1] = (value / den).uint8
Run
and I have managed to go from the original 80fps to 1200fps. But I am still
very far from what I should be able to reach.
The analogous C++ code (using float instead of int32) for reference is way
faster:
auto srcp = reinterpret_cast<const float*>(vsapi->getReadPtr(frame, plane));
auto dstp = reinterpret_cast<float*>(vsapi->getWritePtr(dst, plane));
auto h = vsapi->getFrameHeight(frame, plane);
auto w = vsapi->getFrameWidth(frame, plane);
auto gauss = [&](auto y, auto x) {
auto clamp = [&](auto val, auto bound) {
return std::min(std::max(val, 0), bound - 1);
};
auto above = srcp + clamp(y - 1, h) * w;
auto current = srcp + y * w;
auto below = srcp + clamp(y + 1, h) * w;
auto conv = above[clamp(x - 1, w)] + above[x] * 2 + above[clamp(x + 1,
w)] +
current[clamp(x - 1, w)] * 2 + current[x] * 4 + current[clamp(x +
1, w)] * 2 +
below[clamp(x - 1, w)] + below[x] * 2 + below[clamp(x + 1, w)];
return conv / 16;
};
for (auto y = 0; y < h; y++)
for (auto x = 0; x < w; x++)
(dstp + y * w)[x] = gauss(y, x);
Run
When you mention memory bottleneck, what do you mean? Where should I look at?
Thanks a lot