Okay, just wanted to ensure everyone was working from the same base code. Terry, Brad: you might want to look this proposed change over. Something doesn't quite look right to me, but I haven't really walked through the code to check it.
On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote: > Nadia Derbey wrote: >> >> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote: >> >>> Just to check: is this with the latest trunk? Brad and Terry have been >>> making changes to this section of code, including modifying the >>> PROCESS_IS_BOUND test... >>> >>> >>> >> >> Well, it was on the v1.5. But I just checked: looks like >> 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in >> odls_default_fork_local_proc() >> 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way >> >> But, I'll give it a try with the latest trunk. >> >> Regards, >> Nadia >> >> > The changes, I've done do not touch OPAL_PAFFINITY_PROCESS_IS_BOUND at all. > Also, I am only touching code related to the "bind-to-core" option so I > really doubt if my changes are causing issues here. > > --td >>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote: >>> >>> >>>> Hi, >>>> >>>> I am facing a problem with a test that runs fine on some nodes, and >>>> fails on others. >>>> >>>> I have a heterogenous cluster, with 3 types of nodes: >>>> 1) Single socket , 4 cores >>>> 2) 2 sockets, 4cores per socket >>>> 3) 2 sockets, 6 cores/socket >>>> >>>> I am using: >>>> . salloc to allocate the nodes, >>>> . mpirun binding/mapping options "-bind-to-socket -bysocket" >>>> >>>> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900 >>>> >>>> This command fails if the allocated node is of type #1 (single socket/4 >>>> cpus). >>>> BTW, in that case orte_show_help is referencing a tag >>>> ("could-not-bind-to-socket") that does not exist in >>>> help-odls-default.txt. >>>> >>>> While it succeeds when run on nodes of type #2 or 3. >>>> I think a "bind to socket" should not return an error on a single socket >>>> machine, but rather be a noop. >>>> >>>> The problem comes from the test >>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); >>>> called in odls_default_fork_local_proc() after the binding to the >>>> processors socket has been done: >>>> ======== >>>> <snip> >>>> OPAL_PAFFINITY_CPU_ZERO(mask); >>>> for (n=0; n < orte_default_num_cores_per_socket; n++) { >>>> <snip> >>>> OPAL_PAFFINITY_CPU_SET(phys_cpu, mask); >>>> } >>>> /* if we did not bind it anywhere, then that is an error */ >>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); >>>> if (!bound) { >>>> orte_show_help("help-odls-default.txt", >>>> "odls-default:could-not-bind-to-socket", true); >>>> ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL); >>>> } >>>> ======== >>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in >>>> the mask *AND* the number of bits set is lesser than the number of cpus >>>> on the machine. Thus on a single socket, 4 cores machine the test will >>>> fail. While on other the kinds of machines it will succeed. >>>> >>>> Again, I think the problem could be solved by changing the alogrithm, >>>> and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine = >>>> noop. >>>> >>>> Another solution could be to call the test >>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are >>>> bound (orte_odls_globals.bound). Actually that is the only case where I >>>> see a justification to this test (see attached patch). >>>> >>>> And may be both solutions could be mixed. >>>> >>>> Regards, >>>> Nadia >>>> >>>> >>>> -- >>>> Nadia Derbey <nadia.der...@bull.net> >>>> <001_fix_process_binding_test.patch>_______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> > > > -- > <Mail Attachment.gif> > Terry D. Dontje | Principal Software Engineer > Developer Tools Engineering | +1.650.633.7054 > Oracle - Performance Technologies > 95 Network Drive, Burlington, MA 01803 > Email terry.don...@oracle.com > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel