Re: Parallel coding in Nim (as compared to OpenMP/MPI)

mratsim Fri, 26 Jun 2020 16:45:17 -0700

> 1) Based on the above info, my understanding is that there are (roughly 
> speaking) three different parallel approaches in Nim; is this correct?
>
>> OpenMP (via ||) Weave Threads, channels, etc, as described in the Nim manual 
>> and "Nim in Action", Chap.6 (and also experimental parallel statement: 
>> [https://nim-lang.org/docs/manual_experimental.html#parallel-amp-spawn-parallel-statement](https://nim-lang.org/docs/manual_experimental.html#parallel-amp-spawn-parallel-statement))


Yes, you have OpenMP and raw threads (via createThread/joinThread), Nim 
[theadpool](https://nim-lang.org/docs/threadpool.html), @yglukhov 
[threadpool](https://github.com/yglukhov/threadpools) and Weave are build on 
top of the raw threads.

The difference between a threadpool and Weave (a multithreading runtime) is 
that a runtime is a threadpool + a scheduler to deal with load imbalance.

> 2) In both OpenMP and Weave, is it essential to first cast seq to 
> UncheckedArray and use the latter in a parallel region (e.g., inside for-loop 
> with ||, or a block within init(Weave) ... exit(Weave))? In that case, if we 
> want to use Tensor (in Arraymancer) in parallel regions, is it necessary to 
> first cast it to UncheckedArray somehow by getting a raw (unsafe) data 
> pointer?

No. Internally tensors already use OpenMP (at the moment if compiled with 
-d:openmp) and Weave (in the future). At the moment you can't parallelize on 
top of tensors as OpenMP doesn't really support nesting.

> 3) In the example code of Weave for matrix transpose, captures are used for 
> some variables:
>
>> 
>>       parallelFor j in 0 ..< N:
>>           captures: {M, N, bufIn, bufOut}
>>           parallelFor i in 0 ..< M:
>>             captures: {j, M, N, bufIn, bufOut}
>>       
>>     
>>     Run
> 
> Here, is the meaning of captures similar to shared in OpenMP (roughly 
> speaking)...?

No, the captures are copied (but bufIn/bufOut are pointers to memory location 
so multiple threads access the same memory each through their own copy of the 
pointer. This requires the algorithm to not have data races or to use 
synchronization

> 4) I have installed Weave-0.4.0 (+ Nim-1.2.2) and tried the matrix transpose 
> code shown in the Github page. Here, I also added the below code at the end 
> of main() to print some array elements:
>
>> 
>>       # In proc main():
>>         ...
>>         init(Weave)
>>         transpose(M, N, bufIn, bufOut)
>>         exit(Weave)
>>         
>>         # Show some elements.
>>         echo("input  [2 * N + 5] = ", input[2 * N + 5], " (= ", bufIn[2 * N 
>> + 5], ")")
>>         echo("input  [5 * N + 2] = ", input[5 * N + 2], " (= ", bufIn[5 * N 
>> + 2], ")")
>>         echo()
>>         echo("output [2 * M + 5] = ", output[2 * M + 5], " (= ", bufOut[2 * 
>> M + 5], ")")
>>         echo("output [5 * M + 2] = ", output[5 * M + 2], " (= ", bufOut[5 * 
>> M + 2], ")")
>>       
>>     
>>     Run
> 
> Compiling as nim c --threads:on test.nim gives the expected result:
>
>> 
>>        input  [2 * N + 5] = 4005.0 (= 4005.0)
>>        input  [5 * N + 2] = 10002.0 (= 10002.0)
>>        
>>        output [2 * M + 5] = 10002.0 (= 10002.0)
>>        output [5 * M + 2] = 4005.0 (= 4005.0)
>>       
>>     
>>     Run
> 
> On the other hand, if I moved exit(Weave) after all the above echo 
> statements, the result changes to
>
>> 
>>        input  [2 * N + 5] = 4005.0 (= 4005.0)
>>        input  [5 * N + 2] = 10002.0 (= 10002.0)
>>        
>>        output [2 * M + 5] = 0.0 (= 0.0)
>>        output [5 * M + 2] = 0.0 (= 0.0)
>>       
>>     
>>     Run
> 
> Does this mean that exit(Weave) has the role of some "synchronization"(?) for 
> parallel calculations, and so mandatory before accessing any UncheckedArray 
> used in the parallelFor regions?

It's not intended as such.

The synchronization that Weave supports are:

  * syncRoot to empty the tasks in the runtime, only callable from the root 
thread (i.e. that called init(Weave)) when you call exit(Weave) a 
syncRoot(Weave) is done internally to make sure that there is no task left in 
the runtime and only then the runtime is shut down, hence the synchronization 
you noticed.
  * sync(x) with x a handle returned from:
    * spawn
    * awaitable for loops
    * for loop with reduction
  * spawnOnEvent(event, foo) and spawnOnEvents(event1, event2, event3, foo) 
with event a handle created by newFlowEvent and foo a function that will be 
scheduled only when trigger is called on the event. This allows modeling task 
dependencies beyond the caller/callee relationships
  * for loop dependent triggered on an event
  * for loop with individual iteration dependent on individual iteration from 
another for loop: This allow fine-grain modeling of a first for loop A that 
process data, for example an image, and when a pixel or an iteration A[i] is 
ready, directly schedule the corresponding transformation B[i] in the next loop 
B, even if A[i-1] or A[i+1] are not ready. This is in particular quite useful 
in iterative algorithm like this one:

where after p[i][j] is done you can do iteration t+1 without waiting for 
p[i+2][j+2]

> 5) Again, in the matrix transpose code above, the input (of type seq) is cast 
> to bufIn (of type UncheckedArray) by using the address obtained from 
> input[0].unsafeAddr. On the other hand, .addr is used to cast output to 
> bufOut. Is this difference important, or is it actually OK whichever of 
> .unsafeAddr or .addr is used?
>     
>     >         let input = newSeq[float32](M * N)
>         let bufIn = cast[ptr UncheckedArray[float32]]( input[0].unsafeAddr )
>         ...
>         var output = newSeq[float32](N * M)
>         let bufOut = cast[ptr UncheckedArray[float32]]( output[0].addr )
>         
>     
>     Run
> 
> I am sorry again for many questions! These are not urgent at all (I'm still 
> learning more basic stuff & syntax right now), but I would appreciate any 
> hints and inputs again. Thanks very much :)
> 
> PS. "seems like you love tea ;)" Yes, I recently like to drink Rooibos tea 
> (particularly at night), though I like coffee in the morning :)




unsafeAddr is when the parameter you take the address of is immutable when you 
use let or a function signature you promise that it's not mutable and the 
compiler enforces that, but once you have a pointer, the compiler cannot help 
you anymore hence you need to "opt-in" to this unsafe mode.

Re: Parallel coding in Nim (as compared to OpenMP/MPI)

Reply via email to