Nim provides an easy way to optimize compile-time expressions via [term-rewriting templates (macros)](https://nim-lang.org/docs/manual_experimental.html#term-rewriting-macros) and can be used to rewrite ax + b into fused-multiply-add, exp(x) - 1 into expm1 or ln(x + 1) into ln1p (log1p in <math.h>). They only work if the expression is done in a single line though so the techniques you used in C++ are probably more general.
In terms of need, I only use reverse-mode autodiff so no overlap here. Actually reverse-mode autodifferentiation was how I started Nim ([https://github.com/mratsim/nim-rmad](https://github.com/mratsim/nim-rmad)) after doing some Project Euler with it. For collaboration you might want to talk to Hugo Granstrom, who wrote some ODEs in Nim ([https://github.com/HugoGranstrom/numericalnim](https://github.com/HugoGranstrom/numericalnim) ), I'm mostly working on Machine Learning so I only need reverse-mode autodiff. I guess given the growing number of scientists in Nim (a couple are in bio as well), maybe a IRC+Gitter+Discord+Matrix (whatever is popular) channel for Nim/science might be interesting. Regarding what I'm planning: Laser started as a research into improving [Arraymancer](https://github.com/mratsim/Arraymancer) backend. I have a couple of issues with Arraymancer backend: * Implementing fast algorithms is tedious, for example in order of complexity: * 2D Softmax * Convolution * Recurrent neural networks The reasons are multiple: * min/max reductions are slow, because of [min/max](https://github.com/nim-lang/Nim/issues/9514) but also because reductions are [slow if only using a single accumulator.](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/benchmarks/fp_reduction_latency/reduction_bench.nim#L316-L331) * sometimes it's because I needed to [loop on more than 3 tensors at the same time, sometimes strided as a result of a slice](https://github.com/mratsim/Arraymancer/blob/bde79d2f73b71ece719526a7b39f03bb100784b0/src/tensor/private/p_accessors.nim#L202-L208) * sometimes it's because [OpenMP doesn't support nested loop](https://github.com/nim-lang/RFCs/issues/160), * sometimes it's because the [exp and log in <math.h> can be speedup by 10x](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/benchmarks/vector_math/bench_exp_avx512.nim#L298-L340) and those are a huge bottleneck in natural language processing. * And all the time, it's because SIMD intrinsics and CPU autodetection are needed for maximum performance, also prefetching, [tiling the loops to fit in L1 and L2 cache and all that jazz](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm_tiling.nim#L284-L305). * Once it's done on CPU, I need to do the same on Cuda, AMD ROCm, OpenCL, Metal (maybe) and probably Vulkan Compute. * Then I need to implement their gradient which is first hard to derive and then need the same optimizations. So I started Laser with small goals: * fast reductions * fast parallel iterations/reductions on variadic number of tensors, potentially strided. * fast transcendental functions but then I decided to write a DSL+compiler, inspired by the [Halide DSL](https://halide-lang.org/) for image processing and their [gradient Halide](https://people.csail.mit.edu/tzumao/gradient_halide/) extension for auto-differentiation: * Having a simple DSL to express operations on multi-dimensional arrays * with multiple backends * composable at a high level, and that composition doesn't hurt loop-fusion / create temporaries. * and that DSL has 2 parts, the algorithm, say C[i, j] = A[i, k] * B[k, j] (matrix multiplication) and the schedule: what is vectorized, what is cached, what is parallel, what is tiled. And the schedule can be adapted depending on CPU or GPU. This is important because optimizing compilers don't know how to optimize numerical workloads. * the algorithm can be differentiated. * work at compile-time (Nim macros, or maybe a Nim compiler-plugin) and runtime (via LLVM JIT API) In Lux repo there are several documents that goes into details of: * [Overview of Lux](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/lux_compiler/core/README.md) * [Challenges I want to tackle](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/lux_compiler/core/challenges.md)
