@Brad, Thanks for joining the forum and your hints as to Chapel threading performance:
> Are you able to share the benchmark you were using with us (as well as > information like back-end compiler and version, processor, etc)? In our > experiments, Chapel has generally shown to be competitive with OpenMP, so it > would be interesting for us to understand better what you were doing (prior > to resorting to a homegrown thread pool) in order to make sure nothing's > going horribly awry. I'd also be curious whether you were using > CHPL_TASKS=qthreads or CHPL_TASKS=fifo. Thanks. > > ( = For example, these benchmarks :... demonstrate Chapel creating 12,000,000 > tasks (500,000 x 24 cores) in 4.4 seconds, compared to 4.1 seconds for GNU > OpenMP). (about 8.8/~0.2 microseconds per thread, machine unspecified - GBG) Thanks for your prompting as to checking the CHPL_TASKS environment variable; due to this prompting I re-read the [documentation section on Using Tasks](https://chapel-lang.org/docs/usingchapel/tasks.html) to find that due to my using cygwin on Windows this defaults to "fifo" which is pthreads and and was somewhat confused with some of my benchmarks run on "tio.run" (an online IDE using Linux and therefore "qthreads") and some on my own machine using cygwin and therefore defaulting to "fifo". Your results as posted above are in line with what I was seeing on "tio.run"/Linux in just a few microseconds overhead per task switch using the benchmark program as follows: // testing simple threading... use Time; config const NUMLOOPS = 1000; assert(NUMLOOPS >= 0, "NUMLOOPS must be zero or above!"); config const NUMTHRDS = here.maxTaskPar; assert(NUMTHRDS > 0, "NUMTHRDS must be at least one!"); writeln("Number of loops: ", NUMLOOPS, "; Number of threads: ", NUMTHRDS); var rsltsq: [ 0 .. NUMTHRDS - 1 ] sync int; proc dowrk(qi: int, v: int) { rsltsq[qi] = v; // thread work that doesn't do much of anything! } var timer: Timer; timer.start(); // pre load the thread pool work for i in 0 .. NUMTHRDS - 1 { rsltsq[i].reset(); begin with (const in i) dowrk(i, i); } var qi = 0; var sum = 0; for i in NUMTHRDS .. NUMLOOPS + NUMTHRDS { sum += rsltsq[qi]; if i <= NUMLOOPS then begin with (const in i) dowrk(qi, i); qi = if qi >= NUMTHRDS - 1 then 0 else qi + 1; } timer.stop(); write("The sum is ", sum, " in "); writeln(timer.elapsed(TimeUnits.milliseconds), " milliseconds!"); Run When run on [tio.run](https://tio.run/##hVPNctowEL77KTZcagZwTI845tRDD22Sackpw0HYa6xBtoxWQNNOXr10JcWYTjNTLtKutd/P7lLUokN1Pt/egkWyst0CyaZTCLY2KEpOJEkSRQdCWMkGsygqdFvJLfBBFu6fvn55eHj8DjnM0zTNIkGExsaX/DKHdAqjS9wcuGqD8BONBm1AbPQRb0bj7B/c1edvnxxujQaTRvxYCdo9CnNNEZ4se4YQ9gzCgkLBd90GgpORFlUbj@4PzQYN6AqU1h0tAEbTixOGymB4EdowvPEcDBYdhQFDytKevz1DCkkyqJ7BHNZAL20BsrX8ujO6gFKfzC7ey4VLTuHozzH8iuAN6Xkv1@z4mIGbh2eGkzY7vrObUiO1H9zJHovaqRPti615RjfRa1BkeUZm4UdlshAlZAW3izUzaGeQTTOsrfsRQ6e18jRRxQORLOodN1ci5ToxSOgguc9bfn6StoY4TI5DOX6zKqd8z@DVS9tLtpZm4O50aHwwMF7IAnFYlsmQdvyuapJf9SrjpKwY4C4fithZ@x9d@yCMq70ohuBzmf/t2OOkgIpXnz9PYJ5xl/uW6i7uVyoerbiZTpwktyV84yVyhNdbFwpRiY6wjN2AnlppKWmkUpKQRZY0dnXXCb@35/N5Nuvt5e5vxr@Q8mrzj7@LSoktca7ihf8D) (qthreads), there are no real surprises as it runs a million loops on two threads (the maximum available) on a Sandy Bridge class processor at 2.9 GHZ in about 675 milliseconds and with only one thread in a little over a 1000 milliseconds; this isn't surprising as with such trivial work (almost none), most of the time expended will by in just running the sync'ing. However when run on my cygwin system (Windows Intel Sandy Bridge class i3-2100 CPU with 2 cores/4 threads at 3.1 GHz, compiled with Chapel version 1.22 and gcc version 9.3.0) using default 4 threads, it runs much much slower than the above at about 3000 to 5000 milliseconds for only 10000 loops with full four threads, which is slow but perhaps can be attributed to pthreads being slower than qthreads (although the variable results are worrisome). Furthermore, the shocker is when the number of threads is dropped to say only two, it can only run 1000 loops in about 2500 milliseconds (highly variable upwards from this), and with 1000 loops run on only one thread (argument of \--NUMTHRDS=1), it takes 16,000 milliseconds (and sometimes more). Obviously, taking over ten milliseconds per task start prevented me from being able to effectively run tasks which take about one millisecond each as the overhead per task is over ten times higher than the actual work, and explains why I had to implement the thread pool. This seems to indicate the you do indeed have a problem with your pthreads implementation as I suspected before (even before I understood the difference between "qthreads" and "fifo" and which are default where). Why is "fifo" default with cygwin? Are there problems compiling with "qthreads" on cygwin? While we are on the subject of Chapel problems I encountered (and acknowledging your quick fix of my reported issue related to initializing a shared nilable sync field for a generic record/class, thank you very much), the other distressing problem I encountered was in the slowness of Hash Table's/Associative Array's as seen in [this submission to the Sieve of Eratosthenes on RosettaCode](https://rosettacode.org/wiki/Sieve_of_Eratosthenes#Hash_Table_Based_Odds-Only_Version) (the first version, fixed with a roll-your-own Hash table implementation in the second version). Perhaps you would like to investigate to see why that was so slow and not O(1) amortized access as expected (and as my roll-your-own Hash Table version has). The same test machine and environment was used but the same slowness and non O(1) performance seems to be common to the ["tio.run"/Linux](https://tio.run/##hVPNctowEL77KTZcagZwTI845tRDD22Sackpw0HYa6xBtoxWQNNOXr10JcWYTjNTLtKutd/P7lLUokN1Pt/egkWyst0CyaZTCLY2KEpOJEkSRQdCWMkGsygqdFvJLfBBFu6fvn55eHj8DjnM0zTNIkGExsaX/DKHdAqjS9wcuGqD8BONBm1AbPQRb0bj7B/c1edvnxxujQaTRvxYCdo9CnNNEZ4se4YQ9gzCgkLBd90GgpORFlUbj@4PzQYN6AqU1h0tAEbTixOGymB4EdowvPEcDBYdhQFDytKevz1DCkkyqJ7BHNZAL20BsrX8ujO6gFKfzC7ey4VLTuHozzH8iuAN6Xkv1@z4mIGbh2eGkzY7vrObUiO1H9zJHovaqRPti615RjfRa1BkeUZm4UdlshAlZAW3izUzaGeQTTOsrfsRQ6e18jRRxQORLOodN1ci5ToxSOgguc9bfn6StoY4TI5DOX6zKqd8z@DVS9tLtpZm4O50aHwwMF7IAnFYlsmQdvyuapJf9SrjpKwY4C4fithZ@x9d@yCMq70ohuBzmf/t2OOkgIpXnz9PYJ5xl/uW6i7uVyoerbiZTpwktyV84yVyhNdbFwpRiY6wjN2AnlppKWmkUpKQRZY0dnXXCb@35/N5Nuvt5e5vxr@Q8mrzj7@LSoktca7ihf8D) environment. Perhaps it is an error on my part?
