Re: Parallel coding in Nim (as compared to OpenMP/MPI)

mratsim Wed, 17 Jun 2020 14:30:16 -0700

Nice username, seems like you love tea ;).

Since you're interested in scientific computing, let's pick some examples from 
there, I will use matrix transposition.


You can use OpenMP in Nim, the openMP operator is `||`, it's a bit rough 
because in the OpenMP section you cannot use seq/strings without precautions 
and you need to disable stacktraces so that the compiler doesn't use them as 
well:
    
    
    proc ompNaiveTranspose(M, N: int, bufIn, bufOut: ptr 
UncheckedArray[float32]) =
      ## Transpose a MxN matrix into a NxM matrix
      
      # Write are more expensive than read so we keep i accesses linear for 
writes
      {.push stacktrace:off.}
      for j in 0||(N-1):
        for i in 0 ..< M:
          bufOut[j*M+i] = bufIn[i*N+j]
      {.pop.}
    
    
    Run

You can also emit raw C code as a fallback:
    
    
    proc ompCollapsedTranspose(M, N: int, bufIn, bufOut: ptr 
UncheckedArray[float32]) =
      ## Transpose a MxN matrix into a NxM matrix
      
      # We need to go down to C level for the collapsed clause
      # This relies on M, N, bfIn, bufOut symbols being the same in C and Nim
      # The proper interpolation syntax is a bit busy otherwise
      {.emit: """
      
      #pragma omp parallel for collapse(2)
      for (int i = 0; i < `M`; ++i)
        for (int j = 0; j < `N`; ++j)
          `bufOut`[j*M+i] = `bufIn`[i*N+j];
      """.}
    
    
    Run

Otherwise for a pure Nim solution, I recommend that you use 
[Weave](https://github.com/mratsim/weave). Disclaimer: I wrote it so I'm 
obviously biaised.

Matrix transpose:
    
    
    proc weaveNestedTranspose(M, N: int, bufIn, bufOut: ptr 
UncheckedArray[float32]) =
      ## Transpose a MxN matrix into a NxM matrix with nested for loops
      
      parallelFor j in 0 ..< N:
        captures: {M, N, bufIn, bufOut}
        parallelFor i in 0 ..< M:
          captures: {j, M, N, bufIn, bufOut}
          bufOut[j*M+i] = bufIn[i*N+j]
    
    
    Run

Note that contrary to OpenMP Weave can deal perfectly with nested loops which 
is quite useful for say batching already parallel matrix multiplications.

Also the load balancer is very different from all implementations of OpenMP, 
TBB or runtime I'm aware of and it does not suffer from OpenMP and TBB grain 
size woes:

Assume you need to work on a matrix of 2000 elements, some operations like 
additions or copies should not be parallelized on most CPU because the 
bottleneck is memory not compute and throwing more threads when memory is a 
bottleneck will just give you all the disadvantages of multithreading without 
the advantages.

Now assume instead you need to compute cos/sin/exponentiation, there the work 
is significant so multithreading is very valuable.

But the runtime woes is if you write a generic parallel "apply kernel" function 
that would apply, copy/addition/cos/exp)
    
    
    proc apply[T](dst: var Matrix[T], src: Matrix[T], kernel: proc(x: T): T) =
      # pseudo-code
      myParallelFor i in 0 ..< src.rows:
        myParallelFor j in 0 ..< src.cols:
          kernel(dst, src)
    
    
    Run

Most runtime split work before entering the loops but they have no way to 
understand if the kernel is compute heavy or trivial resulting in many worse 
case behaviors. Their workaround, if available is to supply a "grain size" but 
a library author would have trouble explaining and exposing all that to users 
(i.e. use 1000 for addition on a Xeon platinum, use 200 on a laptop, but also 
make sure that there are no other applications like a web browser running at 
the same time ...).

This was investigated in-depth for PyTorch here: 
[https://github.com/zy97140/omp-benchmark-for-pytorch](https://github.com/zy97140/omp-benchmark-for-pytorch)
 and Weave is the only runtime that addresses this in a clean and always 
efficient way for library authors and end-users.

You have plenty of examples on how to use Weave (and Nim OpenMP) in the 
benchmarks folder: 
[https://github.com/mratsim/weave/tree/9f0c384f/benchmarks](https://github.com/mratsim/weave/tree/9f0c384f/benchmarks)

And the raytracing demo: 
[https://github.com/mratsim/weave/blob/9f0c384f/demos/raytracing/smallpt.nim#L228](https://github.com/mratsim/weave/blob/9f0c384f/demos/raytracing/smallpt.nim#L228)

The benchmarks folder includes a matrix multiplication algorithm written from 
scratch in Nim, it's multithreaded performance is higher than OpenBLAS on my 
machine (about 2.8 TFlops vs 2.7 TFlops) even though the single-threaded kernel 
is lower performance (160GFlops vs 190GFlops) meaning I achieve a 17.5x speedup 
on my 18 cores machine while OpenBLAS/OpenMP is in the 14.2x. And this is using 
large matrices (1920x1920) that are easy to distribute. Matrices like 
(200x1920) would be hard to split with OpenMP as it doesn't support nested 
parallelism.

Re: Parallel coding in Nim (as compared to OpenMP/MPI)

Reply via email to