> I am aware of this. I am just scaffolding the library. Just to avoid doing > something fundamentally wrong (like reaching 80fps instead of >1000fps ;oP). > In the future, the library should be more general. In this case, I take > advance that it is always 1 byte per sample. To be honest I don't know why it > is using float in the C++ version. The API states that both the read and > write pointer are uint8. Doing floating point vector is something that I > should try in the future.
I'm pretty sure it's supposed to be `void *` and the `sampleType` and `bitsPerSample` are telling you what to cast it to. Anyway, very nice to hear your improvements. Arraymancer indexing via `t[i, j]` would be slow because: 1. You are copying the data to the tensor. I have a PR pending to allow zero-copy tensors over pointers but still pending 2. Arraymancer has slightly more overhead on iteration that might suggestion because it can handles N-dimensional tensors not just (Plane, Height, Width) images. It can handle (Frame, Plane, Depth, Height, Width) for a 3D video for example. There are iterators for simple case but convolution is something I'm leaving for a proper compiler. I.e. hopefully in the future the way to specify a convolution would be similar to Halide ([https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf](https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf)) and that would lead to optimal code: proc convolve(A, Filter: Function): Function = # Generator for the convolution function var i, j: Domain var convH, convW: Function # Definition of the result function # Filter F = [1, 2, 1] # So 2D application is: # [[1, 2, 1], # [2, 4, 2], # [1, 2, 1]] # convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2] convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * F[2] # Optimization of the definition - completely optional convH.compute_at(convW, j) # Compute the required region of convH before convW start (i.e. merged) # convH.compute_root() # Alternatively do the height convolve before the width convolve (i.e. 2 steps) convW.tile(i, ii, 64) # Split i in tile of size 64, with ii as the tile iterator convW.parallel(i) # Iterating on tiles is done in parallel convW.tile(j, jj, 64) # Split j in tile of size 64 as well with jj as the tile iterator convW.vectorize(jj, 8) # Iterating on jj is done with size 8 vectors if possible (i.e. AVX on x86) return convW generate convolve: proc foobar(a: Tensor[float32], filter: array[3, float32]): Tensor[float32] Run (From Lux: [https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58](https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58)) The importance of loop tiling (also called loop blocking) is highlighted in this simple transposition benchmark: [https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim) There is a factor 3x between simple parallel transposition (`||` is the OpenMP parallel for loop) proc benchNaive(a: seq[float32], nb_samples: int) = var output = newSeq[float32](out_size) withCompilerOptimHints() let pa{.restrict.} = cast[ptr UncheckedArray[float32]](a[0].unsafeAddr) let po{.restrict.} = cast[ptr UncheckedArray[float32]](output[0].addr) bench("Naive transpose"): discard do: for i in `||`(0, M-1): for j in `||`(0, N-1, "simd"): # This only add "#pragma omp simd" po[i+j*M] = pa[j+i*N] Run 1D tiling proc benchCacheBlocking(a: seq[float32], nb_samples: int) = var output = newSeq[float32](out_size) withCompilerOptimHints() let pa{.restrict.} = a[0].unsafeAddr let po{.restrict.} = output[0].addr const blck = 64 bench("Cache blocking"): discard do: {.emit: """ // No min function in C ... #define min(a,b) (((a)<(b))?(a):(b)) #pragma omp parallel for for (int i = 0; i < `M`; i+=`blck`) for (int j = 0; j < `N`; ++j) #pragma omp simd for (int ii = i; ii < min(i+`blck`,`M`); ++ii) `po`[ii+j*`M`] = `pa`[j+ii*`N`]; """.} Run And 2D tiling proc bench2Dtiling(a: seq[float32], nb_samples: int) = var output = newSeq[float32](out_size) withCompilerOptimHints() let pa{.restrict.} = a[0].unsafeAddr let po{.restrict.} = output[0].addr const blck = 128 bench("2D Tiling"): discard do: {.emit: """ // No min function in C ... #define min(a,b) (((a)<(b))?(a):(b)) #pragma omp parallel for collapse(2) for (int i = 0; i < `M`; i+=`blck`) for (int j = 0; j < `N`; j+=`blck`) for (int ii = i; ii<i+`blck` && ii<`M`; ii++) #pragma omp simd for (int jj = j; jj<min(j+`blck`,`N`); jj++) `po`[ii+jj*`M`] = `pa`[jj+ii*`N`]; """.} Run And you can gain about 20% performance with prefetching proc bench2DtilingExchangedPrefetch(a: seq[float32], nb_samples: int) = var output = newSeq[float32](out_size) withCompilerOptimHints() let pa{.restrict.} = a[0].unsafeAddr let po{.restrict.} = output[0].addr const blck = 32 bench("2D Tiling + Prefetch - input row iteration"): discard do: {.emit: """ #pragma omp parallel for collapse(2) for (int j = 0; j < `N`; j+=`blck`) for (int i = 0; i < `M`; i+=`blck`) for (int jj = j; jj<j+`blck` && jj<`N`; jj++) #pragma omp simd for (int ii = i; ii<min(i+`blck`,`M`); ii++) `po`[ii+jj*`M`] = `pa`[jj+ii*`N`]; __builtin_prefetch(&`pa`[(i+1)*`N`], 0, 1); // Prefetch read with low temporal locality """.} Run Most of the low hanging fruits to accelerate convolution are similar to accelerating transposition so I suggest you try to understand those benchmarks and techniques i.e.: * 2D Tiling * vectorization * parallelization of outermost loop Beyond the the convolution research I already linked, the research in previous post, if you really want to go in the rabbit hole, additional reads are: * automatic loop nest scheduling: [https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md](https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md) * matrix multiplication: [https://github.com/numforge/laser/blob/master/research/matrix_multiplication_optimisation_resources.md](https://github.com/numforge/laser/blob/master/research/matrix_multiplication_optimisation_resources.md) are relevant. Though for matrix multiplication basically the tricks are: > * 6 layers of tiling, 2 for each of the M,N,K dimensions (since we do > `M-by-K * K-by-N -> M-by-N` matrices. > * repack the data to a temporary storage and change the data layout so > that all tile accesses are contiguous on the M and N dimensions (i.e. > transpose so that we access K-by-M and K-by-N > * Extra tiling for vectorizations, and careful track of low-level > register usage. > * all the tiling is carefully tuned to properly use L1 and L2 cache and > also limit TLB faults. > * prefetching > * in actual code: > [https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/matrix_multiplication/gemm_tiling.nim#L284-L305](https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/matrix_multiplication/gemm_tiling.nim#L284-L305)
