Hi,

One of our users has noticed that binding is disabled in 2.0.0 when
--oversubscribe is passed, which is hurting their performance, likely
through migrations between sockets. It looks to be because of 294793c
(PR#1228).

They need to use --oversubscribe as for some reason the developers decided
to run two processes for each MPI task for some reason (a compute process
and an I/O worker process, I think). Since the second process in the pair is
mostly idle, there's (almost) no harm in launching two processes per core -
and it's better than leaving half the cores idle most of the time. In
previous versions they were binding each pair to a core and letting the
hyper-threads argue over which of the two processes to run, since this gave
the best performance.

I tried creating a rankfile and binding each process to its own hardware
thread, but it refuses to launch more processes than the number of cores
(even if all these processes are on the first socket because of the binding)
unless --oversubscribe is passed, and thus disabling the binding. Is there a
way of bypassing the disable-binding-if-oversubscribing check introduced by
that commit? Or can anyone think of a better way of running this program?

Alternatively, they could leave it with no binding at the mpirun level and
do the binding in a wrapper.

Thanks,
Ben



_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to