@mratsim
> Not sure if it's the only bug but for now this is the one to fix for
> multithreading on --newruntime #11844.
It seems that @Araq has pushed a commit to fix that issue; all that remains is
to compile the DEVEL branch and test or wait for a new release...
As to exploring different alternative multi-threading:
> approaching the limits of OpenMP (nested parallelism especially)
I rejected OpenMP early as it is too C'ish and therefore somewhat cludgy
although I have used it before with C and see how its use can be expanded
beyond the basics available "out-of-the-box".
> raw Threads is a too low-level abstraction
At first I thought using the low-level would be the answer as in easily
emulating GoLang's green threads and that channels would work like GoLang
channels; however, it kept getting more complex due to the limitations of not
having GoLang's multi-threaded GC and therefore all the Nim limiations on the
use of channels, meaning that one has to stoop to the use of low level features
as in use of Lock and Cond and essentially building a Nim version of Haskell's
"MVar" to pass through the channels. This has caused me to explore @Araq's
recommended threadpool.
> Nim's FlowVars have contention issues and no load balancing
Funny you should say that about FlowVar's as your link to the "pibench2"
project doesn't use theadpool and therefore doesn't use FlowVar's. I agree that
the linked project using the raw thread's isn't very efficient and should be
re-written to use threadpool and spawn as per the following:
# Compute PI in an inefficient way but multi-theaded...
from cpuinfo import countProcessors
from threadpool import FlowVar, spawn, `^`
from times import epochTime
const DELTA = 1_000_000 # number of terms calculated per thread call!
let numprocs = countProcessors()
proc pi(numterms: int64): float =
result = 0
proc terms(k, klmt: int64): float {.inline.} =
result = 0
for i in countup(k, klmt, 2):
let frst = i + i + 1; result += 1 / frst.float - 1 / (frst + 2).float
# [ # take out the space between the `#` and the `]` to comment this out!
var rslt = newSeq[FlowVar[float]](numprocs) # acts as a rotary queue!
for i in 0 ..< numprocs: # preload waiting queue!
let k = i * DELTA; let klmt = k + (if k + DELTA > numterms: numterms -
k else: DELTA - 1)
rslt[i] = terms(k, klmt).spawn
for i in 0 .. numterms div DELTA:
let id = i mod rslt.len; result += ^rslt[id] # get values from queue!
let k = (i + numprocs) * DELTA
let klmt = k + (if k + DELTA > numterms: numterms - k else: DELTA - 1)
rslt[id] = terms(k, klmt).spawn # add new tasks to queue! # ]#
#[ # put a space between the `#` and the `]` to uncomment this!
for k in countup(0, numterms, DELTA): # do it without threading for
comparison
let xtra = (if k + DELTA > numterms: numterms - k else: DELTA - 1)
result += terms(k, k + xtra) # ]#
result *= 4
let start = epochTime()
let answr = pi(10_000_000_000)
let elpsd = (epochTime() - start) * 1000.0
echo answr, " in ", elpsd, " milliseconds."
Run
The above code does the same calculations as the "pibench2" with default
settings which calculates 10^10 = 10 billion terms for 10 "digits". One can
vary DELTA to see that it starts to lose scaling across threads at something
below about a hundred thousand when the thread task is something about a
millisecond.
The above takes about 3.X seconds on a Sandy Bridge CPU with three cores at 2.5
Gigahertz to do a tenth of the job as per the following WandBox link:
[https://wandbox.org/permlink/9gOlb1WsjjgWFGIm](https://wandbox.org/permlink/9gOlb1WsjjgWFGIm)
but that isn't too certain as the two of the three threads will likely be
shared cores; it takes about 54.2 seconds on my Intel Atom i5-Z8350 Windows
tablet CPU (Silvermont) at 1.44 GHz and four cores and about 36.25 seconds on
my Intel i3-2100 at 3.1 GHz with two cores and four threads. From all of that,
your processor being a half again faster clock speed than this last and with
nine times the number of cores as well as being seven generations newer might
make your 18 core/36 thread machine able to run this in a half second but I
would more expect it to be about two to three seconds. One can comment out the
multi-threading code and uncomment the single threaded code to see that the
code scales with the number of cores other than the uncertainty of
Hyperthreading/Simultaneous Multi-Threading threads and the number of floating
point execution units there are per core and how they are used by the CPU. It
is easy to calculate from these times (especially single threaded ones) that it
is taking about 30 CPU clock cycles per term calculation on these older
generation CPU's so that seems about right and explains why we need at least
about a hundred thousand terms calculated per thread task to make task
switching reasonably efficient. However, number of clock cycles for floating
point calculations could be greatly reduced in more modern CPU's.
spawn doesn't load balance any more than green threads on GoLang or Haskell do
and it is up to the programmer to implement the algorithm to do this
themselves; spawn threads do have the advantage that they avoid a lot of
context switching "thread spin-up" time since the threads are allocated from
the pool and thus multi-threading can be "finer" \- still efficient for thread
tasks taking shorter amounts of time such as less than about a millisecond
rather than about ten milliseconds. It would be nice if channels worked with
them, but it isn't that hard to implement a pool of FlowVar's as done in the
above example code, so it isn't absolutely necessary.
It looks to me that the use of threadpool/spawn/FlowVar may be the best option,
and if some automatic load balancing is required, to build a module on top that
does this.
However, we are still left with the deepCopy for ref's and nested ref's such as
in closures not being compatible, which is why I am hoping that the new run
time may help. But then again, it may not, as the concept of ownership doesn't
really fly well across threads unless ownership can be transferred to each
thread, and such ownership transferring pretty much always mean a deepCopy, a
reference counted memory management, or a multi-threaded Garbage Collection
(which we are trying to avoid).
I'll do a little more work and report back as to it's use for this primes
sieve...