Ralph Castain wrote:
Okay, just wanted to ensure everyone was working from the same base code.
Terry, Brad: you might want to look this proposed change over.
Something doesn't quite look right to me, but I haven't really walked
through the code to check it.
At first blush I don't really get the usage of orte_odls_globals.bound
in you patch. It would seem to me that the insertion of that
conditional would prevent the check it surrounds being done when the
process has not been bounded prior to startup which is a common case.
--td
On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
Nadia Derbey wrote:
On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
Just to check: is this with the latest trunk? Brad and Terry have been making
changes to this section of code, including modifying the PROCESS_IS_BOUND
test...
Well, it was on the v1.5. But I just checked: looks like
1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
odls_default_fork_local_proc()
2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
But, I'll give it a try with the latest trunk.
Regards,
Nadia
The changes, I've done do not touch OPAL_PAFFINITY_PROCESS_IS_BOUND
at all. Also, I am only touching code related to the "bind-to-core"
option so I really doubt if my changes are causing issues here.
--td
On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
Hi,
I am facing a problem with a test that runs fine on some nodes, and
fails on others.
I have a heterogenous cluster, with 3 types of nodes:
1) Single socket , 4 cores
2) 2 sockets, 4cores per socket
3) 2 sockets, 6 cores/socket
I am using:
. salloc to allocate the nodes,
. mpirun binding/mapping options "-bind-to-socket -bysocket"
# salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
This command fails if the allocated node is of type #1 (single socket/4
cpus).
BTW, in that case orte_show_help is referencing a tag
("could-not-bind-to-socket") that does not exist in
help-odls-default.txt.
While it succeeds when run on nodes of type #2 or 3.
I think a "bind to socket" should not return an error on a single socket
machine, but rather be a noop.
The problem comes from the test
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
called in odls_default_fork_local_proc() after the binding to the
processors socket has been done:
========
<snip>
OPAL_PAFFINITY_CPU_ZERO(mask);
for (n=0; n < orte_default_num_cores_per_socket; n++) {
<snip>
OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
}
/* if we did not bind it anywhere, then that is an error */
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
if (!bound) {
orte_show_help("help-odls-default.txt",
"odls-default:could-not-bind-to-socket", true);
ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
}
========
OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
the mask *AND* the number of bits set is lesser than the number of cpus
on the machine. Thus on a single socket, 4 cores machine the test will
fail. While on other the kinds of machines it will succeed.
Again, I think the problem could be solved by changing the alogrithm,
and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
noop.
Another solution could be to call the test
OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
bound (orte_odls_globals.bound). Actually that is the only case where I
see a justification to this test (see attached patch).
And may be both solutions could be mixed.
Regards,
Nadia
--
Nadia Derbey <nadia.der...@bull.net>
<001_fix_process_binding_test.patch>_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
<Mail Attachment.gif>
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>