Mark Adams <[email protected]> writes: > No this is me. They will probably have about 30K (2D linear FE) equations > per 40 Tflop node. 10% (4 Tflops) is too much resources for 30K equations > as it is. No need to try utilize the GPU as far as I can see.
With multiple POWER9 sockets per node, you have to deal with NUMA and separate caches. The rest of the application is not going to do this with threads, so you'll have multiple MPI processes anyway. The entire problem will fit readily in L2 cache and you have a latency problem on the CPU alone. Ask them to make neighborhood collectives fast.
signature.asc
Description: PGP signature
