Sean Hefty wrote:
Well then the rdma-cm needs to know which devices support hw loopback.
Cuz on a T3-only system, no hwloop...
The problem sounds like it's more than just whether 127.0.0.1 is usable. That
check may fix openmpi, but it sounds more like the app needs to know whether the
device can actually support loopback, regardless of what addresses are used. Is
this correct?
What would openmpi do if there were two addresses assigned to the T3 device?
It would use them and might even create two connections.
Does openmpi simply bypass RDMA for all connections on the local machine?
OpenMPI can be run to use hw loopback if its available. For T3
clusters, OMPI is run in a mode to use shared memory for intra-node
communications.
Basically, I'm not sure that this is *just* an rdma_cm issue. Although it
definitely appears that some sort of change needs to be made to the rdma_cm.
I think the OpenMPI rdmacm code needs to skip 127.0.0.1, in this
particular case. Prior to ofed-1.5.1, however, the bind would fail and
thus OpenMPI would not advertise 127.0.0.1 to its peer. I will work to
get that change done.
But lets also add a device attribute so the rdmacm can know if a device
supports loopback. Clearly, if the rdma-cm allows binds to T3,
loopback connections will fail at connect time.
Hey Roland, are you ok with a device attribute to indicate hw-loopback
support?
Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html