Mark Adams <[email protected]> writes:
> No this is me.  They will probably have about 30K (2D linear FE) equations
> per 40 Tflop node.  10% (4 Tflops) is too much resources for 30K equations
> as it is.  No need to try utilize the GPU as far as I can see.

With multiple POWER9 sockets per node, you have to deal with NUMA and
separate caches.  The rest of the application is not going to do this
with threads, so you'll have multiple MPI processes anyway.  The entire
problem will fit readily in L2 cache and you have a latency problem on
the CPU alone.  Ask them to make neighborhood collectives fast.

Attachment: signature.asc
Description: PGP signature

Reply via email to