I'm afraid Jeff is on vacation until Dec 2nd, Paul, so response will be delayed.


On Nov 22, 2013, at 10:19 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote:

> Hi Jeff,
> 
> On 06/19/13 15:26, Jeff Squyres (jsquyres) wrote:
> ...
>>> II.
>>> In the 1.7.x series, the 'carto' framework has been deleted:
>>> http://www.open-mpi.org/community/lists/announce/2013/04/0053.php
>>>> - Removed maffinity, paffinity, and carto frameworks (and associated
>>>>   MCA params).
>>> 
>>> Is there some replacement for this? Or, would Open MPI detect the NUMA 
>>> structure of nodes automatically?
>> 
>> Yes.  OMPI uses hwloc internally now to figure this stuff out.
>> 
>>> Background: Currently we're using the 'carto' framework on our kinda 
>>> special 'Bull BCS' nodes. Each such node consist of 4 boards with own IB 
>>> card but build a shared memory system. Clearly, communicating should go 
>>> over the nearest IB interface - for this we use 'carto' now.
>> 
>> It should do this automatically in the 1.7 series.
>> 
>> Hmm; I see there isn't any verbose output about which devices it picks, 
>> though. :-(  Try this patch, and run with --mca btl_base_verbose 100 and see 
>> if you see appropriate devices being mapped to appropriate processes:
>> 
>> Index: mca/btl/openib/btl_openib_component.c
>> ===================================================================
>> --- mca/btl/openib/btl_openib_component.c    (revision 28652)
>> +++ mca/btl/openib/btl_openib_component.c    (working copy)
>> @@ -2712,6 +2712,8 @@
>>                  mca_btl_openib_component.ib_num_btls <
>>                  mca_btl_openib_component.ib_max_btls); i++) {
>>          if (distance != dev_sorted[i].distance) {
>> +            BTL_VERBOSE(("openib: skipping device %s; it's too far away",
>> +                         ibv_get_device_name(dev_sorted[i].ib_dev)));
>>              break;
>>          }
> 
> Well, I've tried this path on actual 1.7.3 (where the code is moved some 12 
> lines - beginning with 2700).
> !! - no output "skipping device"! Also when starting main processes and 
> -bind-to-socket used. What I see is
> >[cluster.rz.RWTH-Aachen.DE:43670] btl:usnic: found: device mlx4_1, port 1
> >[cluster.rz.RWTH-Aachen.DE:43670] btl:usnic: this is not a usnic-capable 
> >device
> >[cluster.rz.RWTH-Aachen.DE:43670] btl:usnic: found: device mlx4_0, port 1
> >[cluster.rz.RWTH-Aachen.DE:43670] btl:usnic: this is not a usnic-capable 
> >device
> .. one message block per process. Is seems that processes see both IB cards 
> in the special nodes(*) but none were disabled, or at least the verbosity 
> path did not worked.
> 
> Well, is there any progress on this frontline? Or, can I activate more 
> verbosity / what did I do wrong with the path? (see attached file)
> 
> Best!
> Paul Kapinos
> 
> 
> *) the nodes used for testing are also Bull BCS nodes but vonsisting of just 
> two boards instead of 4
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> <btl_openib_component.c>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to