attached is the output of mpirun with some of my debugging printf's Don ________________________________________ From: Barrett, Brian <[email protected]> Sent: Wednesday, November 13, 2019 2:05 PM To: Don Fry; Hefty, Sean; Byrne, John (Labs); [email protected] Subject: Re: [ofiwg] noob questions
That likely means that something failed in initializing the OFI provider. Without seeing the debugging output John mentioned, it's really hard to say *why* it failed to initialize. There are many reasons, including not being able to conform to a bunch of provider assumptions that Open MPI has on its providers. Brian -----Original Message----- From: Don Fry <[email protected]> Date: Wednesday, November 13, 2019 at 2:01 PM To: "Barrett, Brian" <[email protected]>, "Hefty, Sean" <[email protected]>, "Byrne, John (Labs)" <[email protected]>, "[email protected]" <[email protected]> Subject: Re: [ofiwg] noob questions When I tried --mca pml cm it complains that "PML cm cannot be selected". Maybe I needed to enable cm when I configured openmpi? I didn't specifically enable or disable it. It could also be that my getinfo routine doesn't have a capability set properly. my latest command line was: mpirun --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include "lf;ofi_rxm" ./mpi_latency (where lf is my provider) Thanks for the pointers, I will do some more debugging on my end. Don ________________________________________ From: Barrett, Brian <[email protected]> Sent: Wednesday, November 13, 2019 12:53 PM To: Hefty, Sean; Byrne, John (Labs); Don Fry; [email protected] Subject: Re: [ofiwg] noob questions You can force Open MPI to use libfabric as its transport by adding "-mca pml cm -mca mtl ofi" to the mpirun command line. Brian -----Original Message----- From: ofiwg <[email protected]> on behalf of "Hefty, Sean" <[email protected]> Date: Wednesday, November 13, 2019 at 12:52 PM To: "Byrne, John (Labs)" <[email protected]>, Don Fry <[email protected]>, "[email protected]" <[email protected]> Subject: Re: [ofiwg] noob questions My guess is that OpenMPI has an internal socket transport that it is using. You likely need to force MPI to use libfabric, but I don't know enough about OMPI to do that. Jeff (copied) likely knows the answer here, but you may need to create him a new meme for his assistance. - Sean > -----Original Message----- > From: ofiwg <[email protected]> On Behalf Of Byrne, John (Labs) > Sent: Wednesday, November 13, 2019 11:26 AM > To: Don Fry <[email protected]>; [email protected] > Subject: Re: [ofiwg] noob questions > > You only mention the dgram and msg types and the mtl_ofi component wants rdm. If you > don’t support rdm, I would have expected your getinfo routine to return error -61. You > can try using the ofi_rxm provider with your provider to add rdm support, replacing > verbs in “--mca mtl_ofi_provider_include verbs;ofi_rxm” with your provider. > > > > openmpi transport selection is complex. Adding insane levels of verbosity can help you > understand what is happening. I tend to use: --mca mtl_base_verbose 100 --mca > btl_base_verbose 100 --mca pml_base_verbose 100 > > > > John Byrne > > > > From: ofiwg [mailto:[email protected]] On Behalf Of Don Fry > Sent: Wednesday, November 13, 2019 10:54 AM > To: [email protected] > Subject: [ofiwg] noob questions > > > > I have written a libfabric provider for our hardware and it passes all the fabtests I > expect it to (dgram and msg). I am trying to run some MPI tests using libfabrics under > openmpi (4.0.2). When I run a simple ping-pong test using mpirun it sends and receives > the messages using the tcp/ip protocol. It does call my fi_getinfo routine, but > doesn't use my provider send/receive routines. I have rebuilt the libfabric library > disabling sockets, then again --disable-tcp, then --disable-udp, and fi_info reports > fewer and fewer providers until it only lists my provider, but each time I run the mpi > test, it still uses the ip protocol to exchange messages. > > > > When I configured openmpi I specified --with-libfabric=/usr/local/ and the libfabric > library is being loaded and executed. > > > > I am probably doing something obviously wrong, but I don't know enough about MPI or > maybe libfabric, so need some help. If this is the wrong list, redirect me. > > > > Any suggestions? > > Don _______________________________________________ ofiwg mailing list [email protected] https://lists.openfabrics.org/mailman/listinfo/ofiwg
[test@rh3 ~]$ mpirun --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include "lf;ofi_rxm" --mca mtl_base_verbose 100 --mca btl_base_verbose 100 --mca pml_base_verbose 100 ./mpi_latency 2>&1 | tee -a launch.txt [rh3:19714] mca: base: components_register: registering framework btl components [rh3:19714] mca: base: components_register: found loaded component self [rh3:19714] mca: base: components_register: component self register function successful [rh3:19714] mca: base: components_register: found loaded component sm [rh3:19714] mca: base: components_register: found loaded component tcp [rh3:19714] mca: base: components_register: component tcp register function successful [rh3:19714] mca: base: components_register: found loaded component vader [rh3:19714] mca: base: components_register: component vader register function successful [rh3:19714] mca: base: components_open: opening btl components [rh3:19714] mca: base: components_open: found loaded component self [rh3:19714] mca: base: components_open: component self open function successful [rh3:19714] mca: base: components_open: found loaded component tcp [rh3:19714] mca: base: components_open: component tcp open function successful [rh3:19714] mca: base: components_open: found loaded component vader [rh3:19714] mca: base: components_open: component vader open function successful [rh3:19714] select: initializing btl component self [rh3:19714] select: init of component self returned success [rh3:19714] select: initializing btl component tcp [rh3:19714] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8 [rh3:19714] btl: tcp: Found match: 127.0.0.1 (lo) [rh3:19714] btl:tcp: Attempting to bind to AF_INET port 1024 [rh3:19714] btl:tcp: Successfully bound to AF_INET port 1024 [rh3:19714] btl:tcp: my listening v4 socket is 0.0.0.0:1024 [rh3:19714] btl:tcp: examining interface me1 [rh3:19714] btl:tcp: using ipv6 interface me1 [rh3:19714] btl:tcp: examining interface lf0 [rh3:19714] btl:tcp: using ipv6 interface lf0 [rh3:19714] select: init of component tcp returned success [rh3:19714] select: initializing btl component vader [rh3:19714] select: init of component vader returned failure [rh3:19714] mca: base: close: component vader closed [rh3:19714] mca: base: close: unloading component vader [rh3:19714] mca: base: components_register: registering framework pml components [rh3:19714] mca: base: components_register: found loaded component cm [rh3:19714] mca: base: components_register: component cm register function successful [rh3:19714] mca: base: components_open: opening pml components [rh3:19714] mca: base: components_open: found loaded component cm [rh3:19714] mca: base: components_register: registering framework mtl components [rh3:19714] mca: base: components_register: found loaded component ofi [rh3:19714] mca: base: components_register: component ofi register function successful [rh3:19714] mca: base: components_open: opening mtl components [rh3:19714] mca: base: components_open: found loaded component ofi [rh3:19714] mca: base: components_open: component ofi open function successful [rh3:19714] mca: base: components_open: component cm open function successful [rh3:19714] select: initializing pml component cm [rh3:19714] mca:base:select: Auto-selecting mtl components [rh3:19714] mca:base:select:( mtl) Querying component [ofi] [rh3:19714] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [rh3:19714] mca:base:select:( mtl) Selected component [ofi] [rh3:19714] select: initializing mtl component ofi checking info in util_getinfo shm checking info in util_getinfo lf checking info in util_getinfo lf checking info in util_getinfo shm checking info in util_getinfo lf checking info in util_getinfo lf checking info in util_getinfo shm ofi_alter_info returned *info=0xa38ba0 exiting util_getinfo checking info in util_getinfo lf checking info in util_getinfo lf [rh3:19714] mtl_ofi_component.c:315: mtl:ofi:provider_include = "lf;ofi_rxm" [rh3:19714] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream" [rh3:19714] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list [rh3:19714] mtl_ofi_component.c:347: mtl:ofi:prov: none [rh3:19714] mtl_ofi_component.c:541: select_ofi_provider: no provider found [rh3:19714] select: init returned failure for component ofi [rh3:19714] select: no component selected [rh3:19714] select: init returned failure for component cm -------------------------------------------------------------------------- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: rh3 Framework: pml -------------------------------------------------------------------------- [rh3:19714] PML cm cannot be selected [rh4:30969] mca: base: components_register: registering framework btl components [rh4:30969] mca: base: components_register: found loaded component self [rh4:30969] mca: base: components_register: component self register function successful [rh4:30969] mca: base: components_register: found loaded component sm [rh4:30969] mca: base: components_register: found loaded component tcp [rh4:30969] mca: base: components_register: component tcp register function successful [rh4:30969] mca: base: components_register: found loaded component vader [rh4:30969] mca: base: components_register: component vader register function successful [rh4:30969] mca: base: components_open: opening btl components [rh4:30969] mca: base: components_open: found loaded component self [rh4:30969] mca: base: components_open: component self open function successful [rh4:30969] mca: base: components_open: found loaded component tcp [rh4:30969] mca: base: components_open: component tcp open function successful [rh4:30969] mca: base: components_open: found loaded component vader [rh4:30969] mca: base: components_open: component vader open function successful [rh4:30969] select: initializing btl component self [rh4:30969] select: init of component self returned success [rh4:30969] select: initializing btl component tcp [rh4:30969] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8 [rh4:30969] btl: tcp: Found match: 127.0.0.1 (lo) [rh4:30969] btl:tcp: Attempting to bind to AF_INET port 1024 [rh4:30969] btl:tcp: Successfully bound to AF_INET port 1024 [rh4:30969] btl:tcp: my listening v4 socket is 0.0.0.0:1024 [rh4:30969] btl:tcp: examining interface me1 [rh4:30969] btl:tcp: using ipv6 interface me1 [rh4:30969] btl:tcp: examining interface lf0 [rh4:30969] btl:tcp: using ipv6 interface lf0 [rh4:30969] select: init of component tcp returned success [rh4:30969] select: initializing btl component vader [rh4:30969] select: init of component vader returned failure [rh4:30969] mca: base: close: component vader closed [rh4:30969] mca: base: close: unloading component vader [rh4:30969] mca: base: components_register: registering framework pml components [rh4:30969] mca: base: components_register: found loaded component cm [rh4:30969] mca: base: components_register: component cm register function successful [rh4:30969] mca: base: components_open: opening pml components [rh4:30969] mca: base: components_open: found loaded component cm [rh4:30969] mca: base: components_register: registering framework mtl components [rh4:30969] mca: base: components_register: found loaded component ofi [rh4:30969] mca: base: components_register: component ofi register function successful [rh4:30969] mca: base: components_open: opening mtl components [rh4:30969] mca: base: components_open: found loaded component ofi [rh4:30969] mca: base: components_open: component ofi open function successful [rh4:30969] mca: base: components_open: component cm open function successful [rh4:30969] select: initializing pml component cm [rh4:30969] mca:base:select: Auto-selecting mtl components [rh4:30969] mca:base:select:( mtl) Querying component [ofi] [rh4:30969] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [rh4:30969] mca:base:select:( mtl) Selected component [ofi] [rh4:30969] select: initializing mtl component ofi [rh4:30969] mtl_ofi_component.c:315: mtl:ofi:provider_include = "lf;ofi_rxm" [rh4:30969] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream" [rh4:30969] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list [rh4:30969] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list [rh4:30969] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list [rh4:30969] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list [rh4:30969] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list [rh4:30969] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list [rh4:30969] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list [rh4:30969] mtl_ofi_component.c:347: mtl:ofi:prov: none [rh4:30969] mtl_ofi_component.c:541: select_ofi_provider: no provider found [rh4:30969] select: init returned failure for component ofi [rh4:30969] select: no component selected [rh4:30969] select: init returned failure for component cm [rh4:30969] PML cm cannot be selected [rh3:19703] 1 more process has sent help message help-mca-base.txt / find-available:none found [rh3:19703] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
_______________________________________________ ofiwg mailing list [email protected] https://lists.openfabrics.org/mailman/listinfo/ofiwg
