Whenever one speaks about MPI+threads there is a question of how many threads 
t per MPI process one is talking about. (An equivalent way of stating this is 
how many MPI processes has per node).

   Is t 2? 

    is t 4, 

   is it the number of hardware threads on a "single memory socket"

   it is the number of hardware threads on a "CPU"

   is it the number of hardware threads on the entire node? 

   Depending on this one has very different challenges for getting the best 
performance.

   Are "PETSc" solvers like GAMG suppose to deliver great performance across 
the whole range? 

   Jed seems to be hinting at having a relatively small t possibly compared to 
the total number of hardware threads on the node. Is this correct Jed?  Could 
we assume in PETSc that it is always small (and thus some of the performance 
challenges are gone).

   Barry




> On Jan 9, 2015, at 9:44 AM, Jed Brown <[email protected]> wrote:
> 
> Mark Adams <[email protected]> writes:
>> No this is me.  They will probably have about 30K (2D linear FE) equations
>> per 40 Tflop node.  10% (4 Tflops) is too much resources for 30K equations
>> as it is.  No need to try utilize the GPU as far as I can see.
> 
> With multiple POWER9 sockets per node, you have to deal with NUMA and
> separate caches.  The rest of the application is not going to do this
> with threads, so you'll have multiple MPI processes anyway.  The entire
> problem will fit readily in L2 cache and you have a latency problem on
> the CPU alone.  Ask them to make neighborhood collectives fast.

Reply via email to