Nim provides an easy way to optimize compile-time expressions via 
[term-rewriting templates 
(macros)](https://nim-lang.org/docs/manual_experimental.html#term-rewriting-macros)
 and can be used to rewrite ax + b into fused-multiply-add, exp(x) - 1 into 
expm1 or ln(x + 1) into ln1p (log1p in <math.h>). They only work if the 
expression is done in a single line though so the techniques you used in C++ 
are probably more general.

In terms of need, I only use reverse-mode autodiff so no overlap here. Actually 
reverse-mode autodifferentiation was how I started Nim 
([https://github.com/mratsim/nim-rmad](https://github.com/mratsim/nim-rmad)) 
after doing some Project Euler with it.

For collaboration you might want to talk to Hugo Granstrom, who wrote some ODEs 
in Nim 
([https://github.com/HugoGranstrom/numericalnim](https://github.com/HugoGranstrom/numericalnim)
 ), I'm mostly working on Machine Learning so I only need reverse-mode 
autodiff. I guess given the growing number of scientists in Nim (a couple are 
in bio as well), maybe a IRC+Gitter+Discord+Matrix (whatever is popular) 
channel for Nim/science might be interesting.

Regarding what I'm planning:

Laser started as a research into improving 
[Arraymancer](https://github.com/mratsim/Arraymancer) backend. I have a couple 
of issues with Arraymancer backend:

  * Implementing fast algorithms is tedious, for example in order of complexity:
    * 2D Softmax
    * Convolution
    * Recurrent neural networks

The reasons are multiple:

    * min/max reductions are slow, because of 
[min/max](https://github.com/nim-lang/Nim/issues/9514) but also because 
reductions are [slow if only using a single 
accumulator.](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/benchmarks/fp_reduction_latency/reduction_bench.nim#L316-L331)
    * sometimes it's because I needed to [loop on more than 3 tensors at the 
same time, sometimes strided as a result of a 
slice](https://github.com/mratsim/Arraymancer/blob/bde79d2f73b71ece719526a7b39f03bb100784b0/src/tensor/private/p_accessors.nim#L202-L208)
    * sometimes it's because [OpenMP doesn't support nested 
loop](https://github.com/nim-lang/RFCs/issues/160),
    * sometimes it's because the [exp and log in <math.h> can be speedup by 
10x](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/benchmarks/vector_math/bench_exp_avx512.nim#L298-L340)
 and those are a huge bottleneck in natural language processing.
    * And all the time, it's because SIMD intrinsics and CPU autodetection are 
needed for maximum performance, also prefetching, [tiling the loops to fit in 
L1 and L2 cache and all that 
jazz](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm_tiling.nim#L284-L305).
  * Once it's done on CPU, I need to do the same on Cuda, AMD ROCm, OpenCL, 
Metal (maybe) and probably Vulkan Compute.
  * Then I need to implement their gradient which is first hard to derive and 
then need the same optimizations.



So I started Laser with small goals:

  * fast reductions
  * fast parallel iterations/reductions on variadic number of tensors, 
potentially strided.
  * fast transcendental functions



but then I decided to write a DSL+compiler, inspired by the [Halide 
DSL](https://halide-lang.org/) for image processing and their [gradient 
Halide](https://people.csail.mit.edu/tzumao/gradient_halide/) extension for 
auto-differentiation:

  * Having a simple DSL to express operations on multi-dimensional arrays
  * with multiple backends
  * composable at a high level, and that composition doesn't hurt loop-fusion / 
create temporaries.
  * and that DSL has 2 parts, the algorithm, say C[i, j] = A[i, k] * B[k, j] 
(matrix multiplication) and the schedule: what is vectorized, what is cached, 
what is parallel, what is tiled. And the schedule can be adapted depending on 
CPU or GPU. This is important because optimizing compilers don't know how to 
optimize numerical workloads.
  * the algorithm can be differentiated.
  * work at compile-time (Nim macros, or maybe a Nim compiler-plugin) and 
runtime (via LLVM JIT API)



In Lux repo there are several documents that goes into details of:

  * [Overview of 
Lux](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/lux_compiler/core/README.md)
  * [Challenges I want to 
tackle](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/lux_compiler/core/challenges.md)


Reply via email to