> I am aware of this. I am just scaffolding the library. Just to avoid doing 
> something fundamentally wrong (like reaching 80fps instead of >1000fps ;oP). 
> In the future, the library should be more general. In this case, I take 
> advance that it is always 1 byte per sample. To be honest I don't know why it 
> is using float in the C++ version. The API states that both the read and 
> write pointer are uint8. Doing floating point vector is something that I 
> should try in the future.

I'm pretty sure it's supposed to be `void *` and the `sampleType` and 
`bitsPerSample` are telling you what to cast it to.

Anyway, very nice to hear your improvements.

Arraymancer indexing via `t[i, j]` would be slow because:

  1. You are copying the data to the tensor. I have a PR pending to allow 
zero-copy tensors over pointers but still pending
  2. Arraymancer has slightly more overhead on iteration that might suggestion 
because it can handles N-dimensional tensors not just (Plane, Height, Width) 
images. It can handle (Frame, Plane, Depth, Height, Width) for a 3D video for 
example. There are iterators for simple case but convolution is something I'm 
leaving for a proper compiler.



I.e. hopefully in the future the way to specify a convolution would be similar 
to Halide 
([https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf](https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf))
 and that would lead to optimal code:
    
    
    proc convolve(A, Filter: Function): Function =
      # Generator for the convolution function
      var i, j: Domain
      var convH, convW: Function
      
      # Definition of the result function
      # Filter F = [1, 2, 1]
      # So 2D application is:
      # [[1, 2, 1],
      #  [2, 4, 2],
      #  [1, 2, 1]]
      #
      convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2]
      convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * 
F[2]
      
      # Optimization of the definition - completely optional
      convH.compute_at(convW, j) # Compute the required region of convH before 
convW start (i.e. merged)
      # convH.compute_root()     # Alternatively do the height convolve before 
the width convolve (i.e. 2 steps)
      
      convW.tile(i, ii, 64)      # Split i in tile of size 64, with ii as the 
tile iterator
      convW.parallel(i)          # Iterating on tiles is done in parallel
      convW.tile(j, jj, 64)      # Split j in tile of size 64 as well with jj 
as the tile iterator
      convW.vectorize(jj, 8)     # Iterating on jj is done with size 8 vectors 
if possible (i.e. AVX on x86)
      
      return convW
      
      generate convolve:
        proc foobar(a: Tensor[float32], filter: array[3, float32]): 
Tensor[float32]
    
    
    Run

(From Lux: 
[https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58](https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58))

The importance of loop tiling (also called loop blocking) is highlighted in 
this simple transposition benchmark: 
[https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim)

There is a factor 3x between simple parallel transposition (`||` is the OpenMP 
parallel for loop)
    
    
    proc benchNaive(a: seq[float32], nb_samples: int) =
      var output = newSeq[float32](out_size)
      withCompilerOptimHints()
      
      let pa{.restrict.} = cast[ptr UncheckedArray[float32]](a[0].unsafeAddr)
      let po{.restrict.} = cast[ptr UncheckedArray[float32]](output[0].addr)
      
      bench("Naive transpose"):
        discard
      do:
        for i in `||`(0, M-1):
          for j in `||`(0, N-1, "simd"): # This only add "#pragma omp simd"
            po[i+j*M] = pa[j+i*N]
    
    
    Run

1D tiling
    
    
    proc benchCacheBlocking(a: seq[float32], nb_samples: int) =
      var output = newSeq[float32](out_size)
      withCompilerOptimHints()
      
      let pa{.restrict.} = a[0].unsafeAddr
      let po{.restrict.} = output[0].addr
      const blck = 64
      
      bench("Cache blocking"):
        discard
      do:
        {.emit: """
        // No min function in C ...
        #define min(a,b) (((a)<(b))?(a):(b))
        
        #pragma omp parallel for
        for (int i = 0; i < `M`; i+=`blck`)
          for (int j = 0; j < `N`; ++j)
            #pragma omp simd
            for (int ii = i; ii < min(i+`blck`,`M`); ++ii)
              `po`[ii+j*`M`] = `pa`[j+ii*`N`];
        """.}
    
    
    Run

And 2D tiling 
    
    
    proc bench2Dtiling(a: seq[float32], nb_samples: int) =
      var output = newSeq[float32](out_size)
      withCompilerOptimHints()
      
      let pa{.restrict.} = a[0].unsafeAddr
      let po{.restrict.} = output[0].addr
      const blck = 128
      
      bench("2D Tiling"):
        discard
      do:
        {.emit: """
        // No min function in C ...
        #define min(a,b) (((a)<(b))?(a):(b))
        
        #pragma omp parallel for collapse(2)
        for (int i = 0; i < `M`; i+=`blck`)
          for (int j = 0; j < `N`; j+=`blck`)
            for (int ii = i; ii<i+`blck` && ii<`M`; ii++)
              #pragma omp simd
              for (int jj = j; jj<min(j+`blck`,`N`); jj++)
                `po`[ii+jj*`M`] = `pa`[jj+ii*`N`];
        """.}
    
    
    Run

And you can gain about 20% performance with prefetching
    
    
    proc bench2DtilingExchangedPrefetch(a: seq[float32], nb_samples: int) =
      var output = newSeq[float32](out_size)
      withCompilerOptimHints()
      
      let pa{.restrict.} = a[0].unsafeAddr
      let po{.restrict.} = output[0].addr
      const blck = 32
      
      bench("2D Tiling + Prefetch - input row iteration"):
        discard
      do:
        {.emit: """
        #pragma omp parallel for collapse(2)
        for (int j = 0; j < `N`; j+=`blck`)
          for (int i = 0; i < `M`; i+=`blck`)
            for (int jj = j; jj<j+`blck` && jj<`N`; jj++)
              #pragma omp simd
              for (int ii = i; ii<min(i+`blck`,`M`); ii++)
                `po`[ii+jj*`M`] = `pa`[jj+ii*`N`];
            __builtin_prefetch(&`pa`[(i+1)*`N`], 0, 1); // Prefetch read with 
low temporal locality
        """.}
    
    
    Run

Most of the low hanging fruits to accelerate convolution are similar to 
accelerating transposition so I suggest you try to understand those benchmarks 
and techniques i.e.:

  * 2D Tiling
  * vectorization
  * parallelization of outermost loop



Beyond the the convolution research I already linked, the research in previous 
post, if you really want to go in the rabbit hole, additional reads are:

  * automatic loop nest scheduling: 
[https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md](https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md)
  * matrix multiplication: 
[https://github.com/numforge/laser/blob/master/research/matrix_multiplication_optimisation_resources.md](https://github.com/numforge/laser/blob/master/research/matrix_multiplication_optimisation_resources.md)
 are relevant. Though for matrix multiplication basically the tricks are:

>     * 6 layers of tiling, 2 for each of the M,N,K dimensions (since we do 
> `M-by-K * K-by-N -> M-by-N` matrices.
>     * repack the data to a temporary storage and change the data layout so 
> that all tile accesses are contiguous on the M and N dimensions (i.e. 
> transpose so that we access K-by-M and K-by-N
>     * Extra tiling for vectorizations, and careful track of low-level 
> register usage.
>     * all the tiling is carefully tuned to properly use L1 and L2 cache and 
> also limit TLB faults.
>     * prefetching
>     * in actual code: 
> [https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/matrix_multiplication/gemm_tiling.nim#L284-L305](https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/matrix_multiplication/gemm_tiling.nim#L284-L305)



Reply via email to