Re: Fastest Prime Sieve, in Nim

GordonBGood Mon, 05 Aug 2019 03:30:36 -0700

@mratsim

> Not sure if it's the only bug but for now this is the one to fix for 
> multithreading on --newruntime #11844.


It seems that @Araq has pushed a commit to fix that issue; all that remains is 
to compile the DEVEL branch and test or wait for a new release...

As to exploring different alternative multi-threading:

> approaching the limits of OpenMP (nested parallelism especially)

I rejected OpenMP early as it is too C'ish and therefore somewhat cludgy 
although I have used it before with C and see how its use can be expanded 
beyond the basics available "out-of-the-box".

> raw Threads is a too low-level abstraction

At first I thought using the low-level would be the answer as in easily 
emulating GoLang's green threads and that channels would work like GoLang 
channels; however, it kept getting more complex due to the limitations of not 
having GoLang's multi-threaded GC and therefore all the Nim limiations on the 
use of channels, meaning that one has to stoop to the use of low level features 
as in use of Lock and Cond and essentially building a Nim version of Haskell's 
"MVar" to pass through the channels. This has caused me to explore @Araq's 
recommended threadpool.

> Nim's FlowVars have contention issues and no load balancing

Funny you should say that about FlowVar's as your link to the "pibench2" 
project doesn't use theadpool and therefore doesn't use FlowVar's. I agree that 
the linked project using the raw thread's isn't very efficient and should be 
re-written to use threadpool and spawn as per the following:
    
    
    # Compute PI in an inefficient way but multi-theaded...
    from cpuinfo import countProcessors
    from threadpool import FlowVar, spawn, `^`
    from times import epochTime
    
    const DELTA = 1_000_000 # number of terms calculated per thread call!
    
    let numprocs = countProcessors()
    
    proc pi(numterms: int64): float =
      result = 0
      proc terms(k, klmt: int64): float {.inline.} =
        result = 0
        for i in countup(k, klmt, 2):
          let frst = i + i + 1; result += 1 / frst.float - 1 / (frst + 2).float
    # [ # take out the space between the `#` and the `]` to comment this out!
      var rslt = newSeq[FlowVar[float]](numprocs) # acts as a rotary queue!
      for i in 0 ..< numprocs: # preload waiting queue!
        let k = i * DELTA; let klmt = k + (if k + DELTA > numterms: numterms - 
k else: DELTA - 1)
        rslt[i] = terms(k, klmt).spawn
      for i in 0 .. numterms div DELTA:
        let id = i mod rslt.len; result += ^rslt[id] # get values from queue!
        let k = (i + numprocs) * DELTA
        let klmt = k + (if k + DELTA > numterms: numterms - k else: DELTA - 1)
        rslt[id] = terms(k, klmt).spawn # add new tasks to queue! # ]#
    #[ # put a space between the `#` and the `]` to uncomment this!
      for k in countup(0, numterms, DELTA): # do it without threading for 
comparison
        let xtra = (if k + DELTA > numterms: numterms - k else: DELTA - 1)
        result += terms(k, k + xtra) # ]#
      result *= 4
    
    let start = epochTime()
    let answr = pi(10_000_000_000)
    let elpsd = (epochTime() - start) * 1000.0
    echo answr, " in ", elpsd, " milliseconds."
    
    
    Run

The above code does the same calculations as the "pibench2" with default 
settings which calculates 10^10 = 10 billion terms for 10 "digits". One can 
vary DELTA to see that it starts to lose scaling across threads at something 
below about a hundred thousand when the thread task is something about a 
millisecond.

The above takes about 3.X seconds on a Sandy Bridge CPU with three cores at 2.5 
Gigahertz to do a tenth of the job as per the following WandBox link: 
[https://wandbox.org/permlink/9gOlb1WsjjgWFGIm](https://wandbox.org/permlink/9gOlb1WsjjgWFGIm)
 but that isn't too certain as the two of the three threads will likely be 
shared cores; it takes about 54.2 seconds on my Intel Atom i5-Z8350 Windows 
tablet CPU (Silvermont) at 1.44 GHz and four cores and about 36.25 seconds on 
my Intel i3-2100 at 3.1 GHz with two cores and four threads. From all of that, 
your processor being a half again faster clock speed than this last and with 
nine times the number of cores as well as being seven generations newer might 
make your 18 core/36 thread machine able to run this in a half second but I 
would more expect it to be about two to three seconds. One can comment out the 
multi-threading code and uncomment the single threaded code to see that the 
code scales with the number of cores other than the uncertainty of 
Hyperthreading/Simultaneous Multi-Threading threads and the number of floating 
point execution units there are per core and how they are used by the CPU. It 
is easy to calculate from these times (especially single threaded ones) that it 
is taking about 30 CPU clock cycles per term calculation on these older 
generation CPU's so that seems about right and explains why we need at least 
about a hundred thousand terms calculated per thread task to make task 
switching reasonably efficient. However, number of clock cycles for floating 
point calculations could be greatly reduced in more modern CPU's.

spawn doesn't load balance any more than green threads on GoLang or Haskell do 
and it is up to the programmer to implement the algorithm to do this 
themselves; spawn threads do have the advantage that they avoid a lot of 
context switching "thread spin-up" time since the threads are allocated from 
the pool and thus multi-threading can be "finer" \- still efficient for thread 
tasks taking shorter amounts of time such as less than about a millisecond 
rather than about ten milliseconds. It would be nice if channels worked with 
them, but it isn't that hard to implement a pool of FlowVar's as done in the 
above example code, so it isn't absolutely necessary.

It looks to me that the use of threadpool/spawn/FlowVar may be the best option, 
and if some automatic load balancing is required, to build a module on top that 
does this.

However, we are still left with the deepCopy for ref's and nested ref's such as 
in closures not being compatible, which is why I am hoping that the new run 
time may help. But then again, it may not, as the concept of ownership doesn't 
really fly well across threads unless ownership can be transferred to each 
thread, and such ownership transferring pretty much always mean a deepCopy, a 
reference counted memory management, or a multi-threaded Garbage Collection 
(which we are trying to avoid).

I'll do a little more work and report back as to it's use for this primes 
sieve...

Re: Fastest Prime Sieve, in Nim

Reply via email to