How do I ensure that local_sa_cache is enables? I have tried all the other suggestions but I am still getting the error.
Mahmoud Hanafi Sr. System Administrator CSC HPC COE Bld. 676 2435 Fifth Street WPAFB, Ohio 45433 (937) 255-1536 Computer Sciences Corporation Registered Office: 2100 East Grand Avenue, El Segundo California 90245, USA Registered in USA No: C-489-59 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind CSC to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Arlin Davis <[EMAIL PROTECTED]> 02/01/2008 05:45 PM To "Woodruff, Robert J" <[EMAIL PROTECTED]> cc Mahmoud Hanafi/DEF/[EMAIL PROTECTED], [EMAIL PROTECTED], [email protected] Subject Re: [ofa-general] ofed1.2.5rc2 and intel mpi error > This could be related to connection timeouts. We have seen this > on larger clusters when the local sa cache is not enabled or if the SM > node is down. I think that the local_sa_cache defaults to not enabled, > but Arlin can confirm this. > > woody > That is true, OFED 1.2.5 disables SA caching by default. I would recommend enabling SA caching. When using rdma_cm to establish end-to-end connections we incur a 3 step process, each with various tunable knobs. There is ARP, Path Resolution, and CM req/reply. Anyone of these could cause the 4008 timeout error. Here are tunable parameters that may help: 1. ARP: ARP cache entries for ib0 can be increased from default of 30: sysctl –w net.ipv4.neigh.ib0.base_reachable_time=14400 2. PATH RESOLUTION: ib_sa.ko provides path record caching, no timer controls, auto refresh with new device notification events from SM/SA, manual refresh control for administrators, default == SA caching is OFF. To enable: add following to /etc/modprobe.conf - options ib_sa paths_per_dest=0x7f or echo 0x7f > /sys/module/ib_sa/paths_per_dest To manually refresh: echo 1 > /sys/module/ib_sa/refresh To monitor: cat /sys/module/ib_sa/lookup_method * 0 round robin 1 round robin cat /sys/module/ib_sa/paths_per_dest You can also increase the uDAPL PR timeout with the following enviroment variable (if you don't have SA caching): export DAPL_CM_ROUTE_TIMEOUT_MS=20000 (default=4000) 3. CM PROTOCOL: OFED 1.2.5 provides the following module parameters to increase the IB cm response timeout from default of 21: To increase timeout: add following to /etc/modprobe.conf - options rdma_cm cma_response_timeout=23 options ib_cm max_timeout=23 -arlin
_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
