> (* = For example, these benchmarks:
> 
> [https://chapel-lang.org/perf/chapcs/?startdate=2019/12/14&enddate=2020/06/15&graphs=emptytaskspawntimings500000xmaxtaskpar](https://chapel-lang.org/perf/chapcs/?startdate=2019/12/14&enddate=2020/06/15&graphs=emptytaskspawntimings500000xmaxtaskpar)
> 
> demonstrate Chapel creating 12,000,000 tasks (500,000 x 24 cores) in 4.4 
> seconds, compared to 4.1 seconds for GNU OpenMP).

GNU OpenMP shouldn't be used as a comparison for runtime overhead.

For example on fibonacci(40) which is for all intent and purposes a recursive 
spawn of 2^40 empty tasks, I have the following results on my i9-9980XE 18 
cores:

  * GNU OpenMP chokes for at least a hour
  * Clang/Intel OpenMP finishes in 2 seconds (they use the same kaapi backend)
  * Intel TBB finishes in 1 second or 0.6 second depending on if you use the 
class or closure API
  * LLVM-Cilk uses 0.4~0.5 seconds
  * Julia 1.3 uses 4 seconds on fib(31) and too many on fib(40) 
([https://github.com/JuliaLang/julia/issues/33762](https://github.com/JuliaLang/julia/issues/33762))
  * HPX uses 8 seconds on fib(22) and segfaults on more 
([https://github.com/STEllAR-GROUP/hpx/issues/4227](https://github.com/STEllAR-GROUP/hpx/issues/4227))



For less established runtimes:

  * Weave, in pure Nim, uses 0.35 seconds with heap futures or 0.2s with alloca 
futures
  * Fibril is at 0.12s (from Yang and Mellor-Crummey, A Practical Solution to 
the Cactus Stack Problem, 
[https://dl.acm.org/doi/pdf/10.1145/2935764.2935787](https://dl.acm.org/doi/pdf/10.1145/2935764.2935787),
 [https://github.com/chaoran/fibril](https://github.com/chaoran/fibril))
  * Staccato is at 0.18s (from 
[https://github.com/rkuchumov/staccato](https://github.com/rkuchumov/staccato))



12000000 is in the order of 2^23, using TBB as a base that would mean that 
Chapel and GNU OpenMP have 10 times the overhead while creating 2^17x (131072x) 
less tasks.

This overhead was a critic already in 2012 of both HPX and GCC OpenMP by Sandia 
National Laboratory: 
[https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf](https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf)
 (who write the Kokkos runtime)

My own overhead benchmarks with various runtime are here: 
[https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/fibonacci](https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/fibonacci)

The depth first search benchmark is a similar overhead-bound benchmark: 
[https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/dfs](https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/dfs)

I've also categorized benchmarks depending on what kind of difficulties they 
impose on the multithreading runtime: 
[https://github.com/mratsim/weave/tree/master/benchmarks#weave-parallel-benchmark-suite](https://github.com/mratsim/weave/tree/master/benchmarks#weave-parallel-benchmark-suite)

  * Overhead & Extreme overhead (fibonacci, i.e. creating thousands of 
trillions of tasks in milliseconds)
  * Load balancing & Extreme load balancing (bouncing producer-consumer with 
cores alternating in producing and consuming tasks leading to stressing the 
task scheduler and also the memory allocator as free-ing memory allocated from 
another thread is costly and usually each thread have a thread-local memory 
arena, producers end up with depleted arenas and the memory allocator becomes 
mmap with overhead)
  * Nested parallelism
  * Conditional parallelism


Reply via email to