This could be related to connection timeouts. We have seen this
on larger clusters when the local sa cache is not enabled or if the SM
node is down. I think that the local_sa_cache defaults to not enabled,
but Arlin can confirm this.

woody


That is true, OFED 1.2.5 disables SA caching by default. I would
recommend enabling SA caching.

When using rdma_cm to establish end-to-end connections we incur a 3 step process, each with various tunable knobs. There is ARP, Path Resolution, and CM req/reply. Anyone of these could cause the 4008 timeout error.

Here are tunable parameters that may help:

1. ARP:

ARP cache entries for ib0 can be increased from default of 30:

sysctl –w net.ipv4.neigh.ib0.base_reachable_time=14400

2. PATH RESOLUTION:

ib_sa.ko provides path record caching, no timer controls,
auto refresh with new device notification events from SM/SA,
manual refresh control for administrators,
default == SA caching is OFF.
        
To enable: add following to /etc/modprobe.conf -

        options ib_sa paths_per_dest=0x7f
        or
        echo 0x7f > /sys/module/ib_sa/paths_per_dest

To manually refresh:
    echo 1 > /sys/module/ib_sa/refresh

To monitor:
    cat /sys/module/ib_sa/lookup_method
        * 0 round robin
        1 round robin

    cat /sys/module/ib_sa/paths_per_dest


You can also increase the uDAPL PR timeout with the following
enviroment variable (if you don't have SA caching):

export DAPL_CM_ROUTE_TIMEOUT_MS=20000 (default=4000)

3. CM PROTOCOL:

OFED 1.2.5 provides the following module parameters to increase
the IB cm response timeout from default of 21:

To increase timeout: add following to /etc/modprobe.conf -
    options rdma_cm cma_response_timeout=23
    options ib_cm max_timeout=23


-arlin
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to