> >'retries exceeded' means that the transport retry count was >exceeded, so most likely your timeout is set too low.
Is there a common recommended value for this timeout ? I use 18, which represents 1 second. > >Without seeing your code, I couldn't begin to say why you >don't see a send completion. If you are absolutely positive >that you post a send and you never see a completion for that >send, then I guess it is a firmware or hardware problem. It is very hard to reproduce this error with standalone code. I use HP-Mpi and need 8 ranks, at least 4 nodes with 2 cards on each node, and just one of our hundred test code can catch this error, and it is on MPI_Scatterv Operation. --CQ > > - R. > _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
