Sean Hefty wrote: >Currently a DREP is only sent in response to a DREQ if a connection >has been found matching the DREQ, and it is in the proper state. Once >a DREP is sent, the local connection moves into timewait. Duplicate >DREQs received while in this state result in re-sending the DREP. > >However, it's likely that the local connection will enter and exit >timewait before the remote side times out a lost DREP and resends a DREQ. >There are a couple possible solutions to this. One is to increase how >long a connection remains in timewait, by multiplying its wait time by >max_cm_retries. This can greatly increase the timewait state before a QP >can be re-used when CM messages are not lost. > >An alternative is to send a DREP in response to a DREQ, even if a local >connection is not found, which is what this patch does. > >
Would it be possible to get this fix in rc7? I am consistently seeing this problem with Intel MPI on a 64 node cluster. -arlin _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
