@Brad, Thanks for joining the forum and your hints as to Chapel threading 
performance:

> Are you able to share the benchmark you were using with us (as well as 
> information like back-end compiler and version, processor, etc)? In our 
> experiments, Chapel has generally shown to be competitive with OpenMP, so it 
> would be interesting for us to understand better what you were doing (prior 
> to resorting to a homegrown thread pool) in order to make sure nothing's 
> going horribly awry. I'd also be curious whether you were using 
> CHPL_TASKS=qthreads or CHPL_TASKS=fifo. Thanks.
> 
> ( = For example, these benchmarks :... demonstrate Chapel creating 12,000,000 
> tasks (500,000 x 24 cores) in 4.4 seconds, compared to 4.1 seconds for GNU 
> OpenMP). (about 8.8/~0.2 microseconds per thread, machine unspecified - GBG)

Thanks for your prompting as to checking the CHPL_TASKS environment variable; 
due to this prompting I re-read the [documentation section on Using 
Tasks](https://chapel-lang.org/docs/usingchapel/tasks.html) to find that due to 
my using cygwin on Windows this defaults to "fifo" which is pthreads and and 
was somewhat confused with some of my benchmarks run on "tio.run" (an online 
IDE using Linux and therefore "qthreads") and some on my own machine using 
cygwin and therefore defaulting to "fifo".

Your results as posted above are in line with what I was seeing on 
"tio.run"/Linux in just a few microseconds overhead per task switch using the 
benchmark program as follows: 
    
    
    // testing simple threading...
    
    use Time;
    
    config const NUMLOOPS = 1000;
    assert(NUMLOOPS >= 0, "NUMLOOPS must be zero or above!");
    config const NUMTHRDS = here.maxTaskPar;
    assert(NUMTHRDS > 0, "NUMTHRDS must be at least one!");
    writeln("Number of loops:  ", NUMLOOPS, "; Number of threads:  ", NUMTHRDS);
    
    var rsltsq:  [ 0 .. NUMTHRDS - 1 ] sync int;
    
    proc dowrk(qi: int, v: int) {
      rsltsq[qi] = v; // thread work that doesn't do much of anything!
    }
    
    var timer: Timer; timer.start();
    // pre load the thread pool work
    for i in 0 .. NUMTHRDS - 1 {
      rsltsq[i].reset(); begin with (const in i) dowrk(i, i); }
    var qi = 0; var sum = 0;
    for i in NUMTHRDS .. NUMLOOPS + NUMTHRDS {
      sum += rsltsq[qi];
      if i <= NUMLOOPS then begin with (const in i) dowrk(qi, i);
      qi = if qi >= NUMTHRDS - 1 then 0 else qi + 1;
    }
    timer.stop();
    write("The sum is ", sum, " in ");
    writeln(timer.elapsed(TimeUnits.milliseconds), " milliseconds!");
    
    
    Run

When run on 
[tio.run](https://tio.run/##hVPNctowEL77KTZcagZwTI845tRDD22Sackpw0HYa6xBtoxWQNNOXr10JcWYTjNTLtKutd/P7lLUokN1Pt/egkWyst0CyaZTCLY2KEpOJEkSRQdCWMkGsygqdFvJLfBBFu6fvn55eHj8DjnM0zTNIkGExsaX/DKHdAqjS9wcuGqD8BONBm1AbPQRb0bj7B/c1edvnxxujQaTRvxYCdo9CnNNEZ4se4YQ9gzCgkLBd90GgpORFlUbj@4PzQYN6AqU1h0tAEbTixOGymB4EdowvPEcDBYdhQFDytKevz1DCkkyqJ7BHNZAL20BsrX8ujO6gFKfzC7ey4VLTuHozzH8iuAN6Xkv1@z4mIGbh2eGkzY7vrObUiO1H9zJHovaqRPti615RjfRa1BkeUZm4UdlshAlZAW3izUzaGeQTTOsrfsRQ6e18jRRxQORLOodN1ci5ToxSOgguc9bfn6StoY4TI5DOX6zKqd8z@DVS9tLtpZm4O50aHwwMF7IAnFYlsmQdvyuapJf9SrjpKwY4C4fithZ@x9d@yCMq70ohuBzmf/t2OOkgIpXnz9PYJ5xl/uW6i7uVyoerbiZTpwktyV84yVyhNdbFwpRiY6wjN2AnlppKWmkUpKQRZY0dnXXCb@35/N5Nuvt5e5vxr@Q8mrzj7@LSoktca7ihf8D)
 (qthreads), there are no real surprises as it runs a million loops on two 
threads (the maximum available) on a Sandy Bridge class processor at 2.9 GHZ in 
about 675 milliseconds and with only one thread in a little over a 1000 
milliseconds; this isn't surprising as with such trivial work (almost none), 
most of the time expended will by in just running the sync'ing.

However when run on my cygwin system (Windows Intel Sandy Bridge class i3-2100 
CPU with 2 cores/4 threads at 3.1 GHz, compiled with Chapel version 1.22 and 
gcc version 9.3.0) using default 4 threads, it runs much much slower than the 
above at about 3000 to 5000 milliseconds for only 10000 loops with full four 
threads, which is slow but perhaps can be attributed to pthreads being slower 
than qthreads (although the variable results are worrisome).

Furthermore, the shocker is when the number of threads is dropped to say only 
two, it can only run 1000 loops in about 2500 milliseconds (highly variable 
upwards from this), and with 1000 loops run on only one thread (argument of 
\--NUMTHRDS=1), it takes 16,000 milliseconds (and sometimes more). Obviously, 
taking over ten milliseconds per task start prevented me from being able to 
effectively run tasks which take about one millisecond each as the overhead per 
task is over ten times higher than the actual work, and explains why I had to 
implement the thread pool.

This seems to indicate the you do indeed have a problem with your pthreads 
implementation as I suspected before (even before I understood the difference 
between "qthreads" and "fifo" and which are default where).

Why is "fifo" default with cygwin? Are there problems compiling with "qthreads" 
on cygwin?

While we are on the subject of Chapel problems I encountered (and acknowledging 
your quick fix of my reported issue related to initializing a shared nilable 
sync field for a generic record/class, thank you very much), the other 
distressing problem I encountered was in the slowness of Hash 
Table's/Associative Array's as seen in [this submission to the Sieve of 
Eratosthenes on 
RosettaCode](https://rosettacode.org/wiki/Sieve_of_Eratosthenes#Hash_Table_Based_Odds-Only_Version)
 (the first version, fixed with a roll-your-own Hash table implementation in 
the second version). Perhaps you would like to investigate to see why that was 
so slow and not O(1) amortized access as expected (and as my roll-your-own Hash 
Table version has). The same test machine and environment was used but the same 
slowness and non O(1) performance seems to be common to the 
["tio.run"/Linux](https://tio.run/##hVPNctowEL77KTZcagZwTI845tRDD22Sackpw0HYa6xBtoxWQNNOXr10JcWYTjNTLtKutd/P7lLUokN1Pt/egkWyst0CyaZTCLY2KEpOJEkSRQdCWMkGsygqdFvJLfBBFu6fvn55eHj8DjnM0zTNIkGExsaX/DKHdAqjS9wcuGqD8BONBm1AbPQRb0bj7B/c1edvnxxujQaTRvxYCdo9CnNNEZ4se4YQ9gzCgkLBd90GgpORFlUbj@4PzQYN6AqU1h0tAEbTixOGymB4EdowvPEcDBYdhQFDytKevz1DCkkyqJ7BHNZAL20BsrX8ujO6gFKfzC7ey4VLTuHozzH8iuAN6Xkv1@z4mIGbh2eGkzY7vrObUiO1H9zJHovaqRPti615RjfRa1BkeUZm4UdlshAlZAW3izUzaGeQTTOsrfsRQ6e18jRRxQORLOodN1ci5ToxSOgguc9bfn6StoY4TI5DOX6zKqd8z@DVS9tLtpZm4O50aHwwMF7IAnFYlsmQdvyuapJf9SrjpKwY4C4fithZ@x9d@yCMq70ohuBzmf/t2OOkgIpXnz9PYJ5xl/uW6i7uVyoerbiZTpwktyV84yVyhNdbFwpRiY6wjN2AnlppKWmkUpKQRZY0dnXXCb@35/N5Nuvt5e5vxr@Q8mrzj7@LSoktca7ihf8D)
 environment. Perhaps it is an error on my part? 

Reply via email to