Nice username, seems like you love tea ;).
Since you're interested in scientific computing, let's pick some examples from
there, I will use matrix transposition.
You can use OpenMP in Nim, the openMP operator is `||`, it's a bit rough
because in the OpenMP section you cannot use seq/strings without precautions
and you need to disable stacktraces so that the compiler doesn't use them as
well:
proc ompNaiveTranspose(M, N: int, bufIn, bufOut: ptr
UncheckedArray[float32]) =
## Transpose a MxN matrix into a NxM matrix
# Write are more expensive than read so we keep i accesses linear for
writes
{.push stacktrace:off.}
for j in 0||(N-1):
for i in 0 ..< M:
bufOut[j*M+i] = bufIn[i*N+j]
{.pop.}
Run
You can also emit raw C code as a fallback:
proc ompCollapsedTranspose(M, N: int, bufIn, bufOut: ptr
UncheckedArray[float32]) =
## Transpose a MxN matrix into a NxM matrix
# We need to go down to C level for the collapsed clause
# This relies on M, N, bfIn, bufOut symbols being the same in C and Nim
# The proper interpolation syntax is a bit busy otherwise
{.emit: """
#pragma omp parallel for collapse(2)
for (int i = 0; i < `M`; ++i)
for (int j = 0; j < `N`; ++j)
`bufOut`[j*M+i] = `bufIn`[i*N+j];
""".}
Run
Otherwise for a pure Nim solution, I recommend that you use
[Weave](https://github.com/mratsim/weave). Disclaimer: I wrote it so I'm
obviously biaised.
Matrix transpose:
proc weaveNestedTranspose(M, N: int, bufIn, bufOut: ptr
UncheckedArray[float32]) =
## Transpose a MxN matrix into a NxM matrix with nested for loops
parallelFor j in 0 ..< N:
captures: {M, N, bufIn, bufOut}
parallelFor i in 0 ..< M:
captures: {j, M, N, bufIn, bufOut}
bufOut[j*M+i] = bufIn[i*N+j]
Run
Note that contrary to OpenMP Weave can deal perfectly with nested loops which
is quite useful for say batching already parallel matrix multiplications.
Also the load balancer is very different from all implementations of OpenMP,
TBB or runtime I'm aware of and it does not suffer from OpenMP and TBB grain
size woes:
Assume you need to work on a matrix of 2000 elements, some operations like
additions or copies should not be parallelized on most CPU because the
bottleneck is memory not compute and throwing more threads when memory is a
bottleneck will just give you all the disadvantages of multithreading without
the advantages.
Now assume instead you need to compute cos/sin/exponentiation, there the work
is significant so multithreading is very valuable.
But the runtime woes is if you write a generic parallel "apply kernel" function
that would apply, copy/addition/cos/exp)
proc apply[T](dst: var Matrix[T], src: Matrix[T], kernel: proc(x: T): T) =
# pseudo-code
myParallelFor i in 0 ..< src.rows:
myParallelFor j in 0 ..< src.cols:
kernel(dst, src)
Run
Most runtime split work before entering the loops but they have no way to
understand if the kernel is compute heavy or trivial resulting in many worse
case behaviors. Their workaround, if available is to supply a "grain size" but
a library author would have trouble explaining and exposing all that to users
(i.e. use 1000 for addition on a Xeon platinum, use 200 on a laptop, but also
make sure that there are no other applications like a web browser running at
the same time ...).
This was investigated in-depth for PyTorch here:
[https://github.com/zy97140/omp-benchmark-for-pytorch](https://github.com/zy97140/omp-benchmark-for-pytorch)
and Weave is the only runtime that addresses this in a clean and always
efficient way for library authors and end-users.
You have plenty of examples on how to use Weave (and Nim OpenMP) in the
benchmarks folder:
[https://github.com/mratsim/weave/tree/9f0c384f/benchmarks](https://github.com/mratsim/weave/tree/9f0c384f/benchmarks)
And the raytracing demo:
[https://github.com/mratsim/weave/blob/9f0c384f/demos/raytracing/smallpt.nim#L228](https://github.com/mratsim/weave/blob/9f0c384f/demos/raytracing/smallpt.nim#L228)
The benchmarks folder includes a matrix multiplication algorithm written from
scratch in Nim, it's multithreaded performance is higher than OpenBLAS on my
machine (about 2.8 TFlops vs 2.7 TFlops) even though the single-threaded kernel
is lower performance (160GFlops vs 190GFlops) meaning I achieve a 17.5x speedup
on my 18 cores machine while OpenBLAS/OpenMP is in the 14.2x. And this is using
large matrices (1920x1920) that are easy to distribute. Matrices like
(200x1920) would be hard to split with OpenMP as it doesn't support nested
parallelism.