Hi, One of our users has noticed that binding is disabled in 2.0.0 when --oversubscribe is passed, which is hurting their performance, likely through migrations between sockets. It looks to be because of 294793c (PR#1228).
They need to use --oversubscribe as for some reason the developers decided to run two processes for each MPI task for some reason (a compute process and an I/O worker process, I think). Since the second process in the pair is mostly idle, there's (almost) no harm in launching two processes per core - and it's better than leaving half the cores idle most of the time. In previous versions they were binding each pair to a core and letting the hyper-threads argue over which of the two processes to run, since this gave the best performance. I tried creating a rankfile and binding each process to its own hardware thread, but it refuses to launch more processes than the number of cores (even if all these processes are on the first socket because of the binding) unless --oversubscribe is passed, and thus disabling the binding. Is there a way of bypassing the disable-binding-if-oversubscribing check introduced by that commit? Or can anyone think of a better way of running this program? Alternatively, they could leave it with no binding at the mpirun level and do the binding in a wrapper. Thanks, Ben _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel