> > Changqing> Is there a common recommended value for this timeout ? > Changqing> I use 18, which represents 1 second. > >18 should be OK I guess, unless you have congestion in your >fabric, in which case you have other problems anyway. > > Changqing> It is very hard to reproduce this error with standalone > Changqing> code. I use HP-Mpi and need 8 ranks, at least 4 nodes > Changqing> with 2 cards on each node, and just one of our hundred > Changqing> test code can catch this error, and it is on > Changqing> MPI_Scatterv Operation. > >Unless you can narrow down a way to reproduce this, I don't >think it's going to be possible for anyone to help debug it.
OK, I forget to mention, if I use rdma on both channels, it is hard to reprocude the hang, If I create SRQ on one of the channel, then it hangs the other channel even on the first Rdma operation, I will write a standlone code for you driver guys to debug. --CQ > > - R. > _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
