Okay, just wanted to ensure everyone was working from the same base code.

Terry, Brad: you might want to look this proposed change over. Something 
doesn't quite look right to me, but I haven't really walked through the code to 
check it.


On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:

> Nadia Derbey wrote:
>> 
>> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
>>   
>>> Just to check: is this with the latest trunk? Brad and Terry have been 
>>> making changes to this section of code, including modifying the 
>>> PROCESS_IS_BOUND test...
>>> 
>>> 
>>>     
>> 
>> Well, it was on the v1.5. But I just checked: looks like
>>   1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
>>      odls_default_fork_local_proc()
>>   2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
>> 
>> But, I'll give it a try with the latest trunk.
>> 
>> Regards,
>> Nadia
>> 
>>   
> The changes, I've done do not touch OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  
> Also, I am only touching code related to the "bind-to-core" option so I 
> really doubt if my changes are causing issues here.
> 
> --td
>>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
>>> 
>>>     
>>>> Hi,
>>>> 
>>>> I am facing a problem with a test that runs fine on some nodes, and
>>>> fails on others.
>>>> 
>>>> I have a heterogenous cluster, with 3 types of nodes:
>>>> 1) Single socket , 4 cores
>>>> 2) 2 sockets, 4cores per socket
>>>> 3) 2 sockets, 6 cores/socket
>>>> 
>>>> I am using:
>>>> . salloc to allocate the nodes,
>>>> . mpirun binding/mapping options "-bind-to-socket -bysocket"
>>>> 
>>>> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
>>>> 
>>>> This command fails if the allocated node is of type #1 (single socket/4
>>>> cpus).
>>>> BTW, in that case orte_show_help is referencing a tag
>>>> ("could-not-bind-to-socket") that does not exist in
>>>> help-odls-default.txt.
>>>> 
>>>> While it succeeds when run on nodes of type #2 or 3.
>>>> I think a "bind to socket" should not return an error on a single socket
>>>> machine, but rather be a noop.
>>>> 
>>>> The problem comes from the test
>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
>>>> called in odls_default_fork_local_proc() after the binding to the
>>>> processors socket has been done:
>>>> ========
>>>>    <snip>
>>>>    OPAL_PAFFINITY_CPU_ZERO(mask);
>>>>    for (n=0; n < orte_default_num_cores_per_socket; n++) {
>>>>        <snip>
>>>>        OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
>>>>    }
>>>>    /* if we did not bind it anywhere, then that is an error */
>>>>    OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
>>>>    if (!bound) {
>>>>        orte_show_help("help-odls-default.txt",
>>>>                       "odls-default:could-not-bind-to-socket", true);
>>>>        ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
>>>>    }
>>>> ========
>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
>>>> the mask *AND* the number of bits set is lesser than the number of cpus
>>>> on the machine. Thus on a single socket, 4 cores machine the test will
>>>> fail. While on other the kinds of machines it will succeed.
>>>> 
>>>> Again, I think the problem could be solved by changing the alogrithm,
>>>> and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
>>>> noop.
>>>> 
>>>> Another solution could be to call the test
>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
>>>> bound (orte_odls_globals.bound). Actually that is the only case where I
>>>> see a justification to this test (see attached patch).
>>>> 
>>>> And may be both solutions could be mixed.
>>>> 
>>>> Regards,
>>>> Nadia
>>>> 
>>>> 
>>>> -- 
>>>> Nadia Derbey <nadia.der...@bull.net>
>>>> <001_fix_process_binding_test.patch>_______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>       
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>>     
> 
> 
> -- 
> <Mail Attachment.gif>
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.650.633.7054
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to