>> I understood that Fab checked this issue (by 10 retries of 1 second TO)
>> and found that it didn't help there. Yet another try can be enlarging
>> the TO to be 5 sec and sending less retries
>
> I think some exponential backoff strategy with some randomization
> might be better.

The problem with this is that the layers above IPoIB (namely the network stack 
generating ARP requests and expecting ARP responses) doesn't have visibility 
into this backoff strategy, and will give up on an ARP request if the response 
doesn't come back in time.  The response could be delayed for a long time if 
the SM isn't responding to queries in a timely manner, since IPoIB needs to 
resolve the path in order to send the unicast response.  I don't know the 
timeout for an ARP response, but I'd be surprised if it was 10 seconds, let 
alone whatever you would get with exponential backoff.

I initially tried exponential backoff to resolve the problem I was seeing with 
these MPI apps, and it didn't work because of this.  That's when I set out on a 
path to take the SM out of the equation as much as possible.

-Fab
_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Reply via email to