Thank you all for your answers, suggestions, and explanations.

Don
________________________________________
From: Jeff Squyres (jsquyres) <[email protected]>
Sent: Wednesday, November 13, 2019 4:48 PM
To: Don Fry
Cc: Hefty, Sean; James Swaro; Barrett, Brian; Byrne, John (Labs); 
[email protected]
Subject: Re: [ofiwg] noob questions

Just to clarify the Open MPI behavior for you...

PML = point to point messaging layer. PML plug-ins effect MPI point to point 
calls such as MPI_Send and MPI_Recv.

Open MPI has a few different PMLs. One of them is CM, which is a super thin 
layer of high-level translation glue between Open MPI and backend APIs that can 
handle network level matching (such as various Libfabric providers). These 
back-end APIs are implemented in MTL (matching transport layer) plugins. MTLs 
are the low-level translation glue to the back-end matching-capable network 
API. Open MPI has an OFI MTL which makes Libfabric API calls.

Hence: MPI_Send —> CM PML —> OFI MTL —> Libfabric.

The OFI MTL expects to be able to use LibFabric RDM endpoints (you can see all 
the attributes it asked for in the log).

You can pass CLI options to mpirun to effect which PML and MTL plugins are used 
(vs letting OMPI auto selecting). You can also pass in a variety of run time 
params to each of those plugins.

If you tell Open MPI to only use CM and the OFI MTL but the OFI MTL fails to 
enable itself because it can’t find a provider that has all the requirements 
it’s looking for, that will result in CM disabling itself (because it can’t 
find an appropriate MTL to use). Then you get the error message you saw: peers 
can’t reach other.

If you let OMPI auto select, CM will silently fail and a different PML may be 
selected, and some other network stack will be used. In your case, the OB1 PML 
is likely selected, and the TCP BTL is used (which does not use Libfabric at 
all).

I.E., the root of your problem is what others have stated on this thread - your 
Libfabric provider is not matching what the OFI MTL is asking for, and things 
go downhill from there.  You can use mpirun CLI options (noted elsewhere on 
this thread) to tell the OFI MTL to use RXD and/or RXM and your provider behind 
it. Then the OFI MTL will enable itself, and the CM PML will enable itself, and 
then you’re off to the races.

Now you know what / why. :)

Sent from my phone. No type good.

> On Nov 13, 2019, at 6:17 PM, Don Fry <[email protected]> wrote:
>
> I am still trying to crawl, not run a marathon yet.
>
> Don
>
> ________________________________________
> From: Hefty, Sean <[email protected]>
> Sent: Wednesday, November 13, 2019 3:15 PM
> To: Don Fry; James Swaro; Barrett, Brian; Byrne, John (Labs); 
> [email protected]
> Subject: RE: [ofiwg] noob questions
>
>> It will need ofi_rxd and/or ofi_rxm since it supports both DGRAM and MSG.
>
> RxM will need RMA from the msg ep for MPI.
>
> RxD should be okay with just send and receive.  So I would try to get OMPI 
> running with that, but I wouldn’t expect great performance.
>
> - Sean
> _______________________________________________
> ofiwg mailing list
> [email protected]
> https://lists.openfabrics.org/mailman/listinfo/ofiwg
_______________________________________________
ofiwg mailing list
[email protected]
https://lists.openfabrics.org/mailman/listinfo/ofiwg

Reply via email to