We managed to get access to a large cluster with 1000+ nodes and 10,000+ cores for testing/benchmarking. I am happy to say that uDAPL successfully scaled out to more then 14,000 cores. However, when running Intel MPI and uDAPL (OFED 1.2.5, mlx4 DDR) we discovered that the uDAPL rdma_cm provider would not scale beyond 256 nodes so we had to move back to a socket cm provider to setup the QP's. This patch set brings back socket cm (slight redesign) with some fixes and cleanup.
For the record, the basic reason for rdma_cm scaling problems was path record queries. Until there is consensus on IB path record caching solutions that scales and is moved upstream I am recommending that uDAPL IB consumers needing large scale-out use socket cm provider (libdaplscm.so) in leiu of rdma_cm (libdaplcma.so). iWARP support will remain via uDAPL rdma_cm provider. -arlin _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
