I'm afraid Jeff is on vacation until Dec 2nd, Paul, so response will be delayed.
On Nov 22, 2013, at 10:19 AM, Paul Kapinos <kapi...@rz.rwth-aachen.de> wrote: > Hi Jeff, > > On 06/19/13 15:26, Jeff Squyres (jsquyres) wrote: > ... >>> II. >>> In the 1.7.x series, the 'carto' framework has been deleted: >>> http://www.open-mpi.org/community/lists/announce/2013/04/0053.php >>>> - Removed maffinity, paffinity, and carto frameworks (and associated >>>> MCA params). >>> >>> Is there some replacement for this? Or, would Open MPI detect the NUMA >>> structure of nodes automatically? >> >> Yes. OMPI uses hwloc internally now to figure this stuff out. >> >>> Background: Currently we're using the 'carto' framework on our kinda >>> special 'Bull BCS' nodes. Each such node consist of 4 boards with own IB >>> card but build a shared memory system. Clearly, communicating should go >>> over the nearest IB interface - for this we use 'carto' now. >> >> It should do this automatically in the 1.7 series. >> >> Hmm; I see there isn't any verbose output about which devices it picks, >> though. :-( Try this patch, and run with --mca btl_base_verbose 100 and see >> if you see appropriate devices being mapped to appropriate processes: >> >> Index: mca/btl/openib/btl_openib_component.c >> =================================================================== >> --- mca/btl/openib/btl_openib_component.c (revision 28652) >> +++ mca/btl/openib/btl_openib_component.c (working copy) >> @@ -2712,6 +2712,8 @@ >> mca_btl_openib_component.ib_num_btls < >> mca_btl_openib_component.ib_max_btls); i++) { >> if (distance != dev_sorted[i].distance) { >> + BTL_VERBOSE(("openib: skipping device %s; it's too far away", >> + ibv_get_device_name(dev_sorted[i].ib_dev))); >> break; >> } > > Well, I've tried this path on actual 1.7.3 (where the code is moved some 12 > lines - beginning with 2700). > !! - no output "skipping device"! Also when starting main processes and > -bind-to-socket used. What I see is > >[cluster.rz.RWTH-Aachen.DE:43670] btl:usnic: found: device mlx4_1, port 1 > >[cluster.rz.RWTH-Aachen.DE:43670] btl:usnic: this is not a usnic-capable > >device > >[cluster.rz.RWTH-Aachen.DE:43670] btl:usnic: found: device mlx4_0, port 1 > >[cluster.rz.RWTH-Aachen.DE:43670] btl:usnic: this is not a usnic-capable > >device > .. one message block per process. Is seems that processes see both IB cards > in the special nodes(*) but none were disabled, or at least the verbosity > path did not worked. > > Well, is there any progress on this frontline? Or, can I activate more > verbosity / what did I do wrong with the path? (see attached file) > > Best! > Paul Kapinos > > > *) the nodes used for testing are also Bull BCS nodes but vonsisting of just > two boards instead of 4 > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > <btl_openib_component.c>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel