Thanks for the information -- I have some follow-on inline below. On Wed, Jul 8, 2009 at 2:37 AM, Sean Hefty <sean.he...@intel.com> wrote:
> >We are trying to use OpenMPI 1.3.2 with rdma_cm support on an Infiniband > fabric > >using OFED 1.4.1. When the MPI jobs get large enough, the event response > to > >rdma_resolve_route becomes RDMA_CM_EVENT_ROUTE_ERROR with a status of > >ETIMEDOUT. > > Yep - you pretty much need to connect out of band with all large MPI jobs > using > made up path data, or enable some sort of PR caching. I should have mentioned that the fabric is a large torus using LASH routing, and we need to get the live SL value to make deadlock-free connections. We are definitely thinking about PR caching, but that raises issues about how to manage the life of the cache entries. > > > >It seems pretty clear that the SA path record requests are being > synchronized > >and bunching together, and in the end exhausting the resources of the > subnet > >manager node so only the first N are actually received. > > In our testing, we discovered that the SA almost never dropped any queries. > The > problem was that the backlog grew so huge, that all requests had timed out > before they could be acted on. There's probably something that could be > done > here to avoid storing received MADs for extended periods of time. This is encouraging. I did try testing with 10,000 ms timeouts and still got the failure with only 800 different processes, so I jumped to the conclusion that the queries were being dropped. Do you have a guess as to a timeout value that would always succeed? Also, your testing suggests that the receive queue almost never gets exhausted. At least as I understand things, if the queue ends up empty then the HCA can dump packets at great speed. How does the system cope with a potential stream of requests arriving less than half a microsecond apart? (I should have mentioned that the fabric is QDR.) I guess this is another way of asking my question about how to maximize the ability of the subnet manager node to accept requests. > > > >The sequence seems to be: > > > >call librdmacm-1.0.8/src/cma.c's rdma_resolve_route > > > >which translates directly into a kernel call into infiniband/core/cma.c's > >rdma_resolve_route > > > >with an IB fabric becomes a call into cma_resolve_ib_route > > > >which leads to a call to cma_query_ib_route > > > >which gets to calling infiniband/core/sa_query.c's ib_sa_path_rec_get with > the > >callback pointing to cma_query_handler > > > >When cma_query_handler gets a callaback with a bad status, it sets the > returned > >event to RDMA_CM_EVENT_ROUTE_ERROR > > > >Nowhere in there do I see any retry attempts. If the SA path record query > >packet, or it's response packet, gets lost, then the timeout eventually > happens > >and we see RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT. > > The kernel sa_query module does not issue retries. All retries are the > responsibility of the caller. This gives greater flexibility to how > timeouts > are handled, but has the drawback that all 'retries' are really new > transactions. > > >First question: Did I miss a retry buried somewhere in all of that? > > I don't believe so. Thanks for the confirmation. There have been several people telling me that it is in there, and I couldn't find it. > > > >Second question: How does somebody come up with a timeout value that makes > >sense? Assuming retries are the responsibility of the rdma_resolve_route > >caller, you would like to have a value that is long enough to avoid false > >timeouts when a response is eventually going to make it, but not any > longer. > >This value seems like it would be dependent on the fabric and the > capabilities > >of the node running the subnet manager, and should be a fabric-specific > >parameter instead of something chosen at random by each caller of > >rdma_resolve_route. > > The timeout is also dependent on the load hitting the SA. I don't know > that a > fabric-specific parameter can work. Maybe I should have come up with a better name. By fabric-specific, I meant a specific implentation of the fabric, including the capability of the subnet manager node. How does somebody writing rdma_cm code come up with a number? That particular program might not put much of a load on the SA, but could run concurrently with other jobs that do (or don't). It would be nice to have a way to set up the retry mechanism so that it would work on any system it ran on. > > > - Sean >
_______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general