> (* = For example, these benchmarks: > > [https://chapel-lang.org/perf/chapcs/?startdate=2019/12/14&enddate=2020/06/15&graphs=emptytaskspawntimings500000xmaxtaskpar](https://chapel-lang.org/perf/chapcs/?startdate=2019/12/14&enddate=2020/06/15&graphs=emptytaskspawntimings500000xmaxtaskpar) > > demonstrate Chapel creating 12,000,000 tasks (500,000 x 24 cores) in 4.4 > seconds, compared to 4.1 seconds for GNU OpenMP).
GNU OpenMP shouldn't be used as a comparison for runtime overhead. For example on fibonacci(40) which is for all intent and purposes a recursive spawn of 2^40 empty tasks, I have the following results on my i9-9980XE 18 cores: * GNU OpenMP chokes for at least a hour * Clang/Intel OpenMP finishes in 2 seconds (they use the same kaapi backend) * Intel TBB finishes in 1 second or 0.6 second depending on if you use the class or closure API * LLVM-Cilk uses 0.4~0.5 seconds * Julia 1.3 uses 4 seconds on fib(31) and too many on fib(40) ([https://github.com/JuliaLang/julia/issues/33762](https://github.com/JuliaLang/julia/issues/33762)) * HPX uses 8 seconds on fib(22) and segfaults on more ([https://github.com/STEllAR-GROUP/hpx/issues/4227](https://github.com/STEllAR-GROUP/hpx/issues/4227)) For less established runtimes: * Weave, in pure Nim, uses 0.35 seconds with heap futures or 0.2s with alloca futures * Fibril is at 0.12s (from Yang and Mellor-Crummey, A Practical Solution to the Cactus Stack Problem, [https://dl.acm.org/doi/pdf/10.1145/2935764.2935787](https://dl.acm.org/doi/pdf/10.1145/2935764.2935787), [https://github.com/chaoran/fibril](https://github.com/chaoran/fibril)) * Staccato is at 0.18s (from [https://github.com/rkuchumov/staccato](https://github.com/rkuchumov/staccato)) 12000000 is in the order of 2^23, using TBB as a base that would mean that Chapel and GNU OpenMP have 10 times the overhead while creating 2^17x (131072x) less tasks. This overhead was a critic already in 2012 of both HPX and GCC OpenMP by Sandia National Laboratory: [https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf](https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf) (who write the Kokkos runtime) My own overhead benchmarks with various runtime are here: [https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/fibonacci](https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/fibonacci) The depth first search benchmark is a similar overhead-bound benchmark: [https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/dfs](https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/dfs) I've also categorized benchmarks depending on what kind of difficulties they impose on the multithreading runtime: [https://github.com/mratsim/weave/tree/master/benchmarks#weave-parallel-benchmark-suite](https://github.com/mratsim/weave/tree/master/benchmarks#weave-parallel-benchmark-suite) * Overhead & Extreme overhead (fibonacci, i.e. creating thousands of trillions of tasks in milliseconds) * Load balancing & Extreme load balancing (bouncing producer-consumer with cores alternating in producing and consuming tasks leading to stressing the task scheduler and also the memory allocator as free-ing memory allocated from another thread is costly and usually each thread have a thread-local memory arena, producers end up with depleted arenas and the memory allocator becomes mmap with overhead) * Nested parallelism * Conditional parallelism
