2008/5/17 Dotan Barak <[EMAIL PROTECTED]>: > Rui Machado wrote: >> >> 2008/5/16 Roland Dreier <[EMAIL PROTECTED]>: >> >>> >>> > hmm..... and is there no workaround for this, for this situation? I >>> > mean, if the server dies isn't there any possibility that >>> > the sender/client realizes this. If the timeout it's too large this >>> > can be cumbersome. >>> > >>> > I tried reducing the timeout and indeed the client realizes faster >>> > when the server exits but another problem arises: Without exiting the >>> > server, >>> > on the client side I get the error (retry exceed) when polling for a >>> > recently posted send - this after some hours. >>> >>> There's a tradeoff between detecting real failures faster, and reducing >>> false errors detected because a response came too slowly. >>> >>> Clearly if a response may take an amount of time 'X' to be received >>> under normal conditions, there's no way to conclude that the remote side >>> has failed without waiting at least 'X'. >>> >>> >> >> I understand. So there's no really difference between the two >> situations, real server failure or just a load problem that takes more >> time? >> > > From the sender QP point of view, they are the same (ack/nack wasn't send > during a specific > period of time) >> >> Something like a different error or a SIGPIPE :) ? >> >> I will describe my situation, maybe it helps (bare with me as I'm >> starting with Infiniband and so on) >> I have a client and a server.The clients posts RDMA calls one at a >> time (post, poll, post...). So server is just there. >> If I try to start something like 16 clients on 1 machine, after a few >> hours I will get an error on some client programs (retry excess) with >> a timeout of 14. If I increase the timeout for 32, I don't see that >> error but if I stop the server, the clients take a lot of time to >> acknowledge that, which is also not wanted. >> That's why I asked if there a 'good value'. If I have such a load >> between 2 nodes, I always have to risk that if the server dies the >> client will take much time to see it. That's not nice! >> > > Did you try to increase the retry_count too? > (and not only the timeout).
But that wouldn't change my scenario since the overall time is given by the timeout * retry count right? > By the way, Which RDMA operation do you execute READ or WRITE? >> READ. >> Thanks for the help and quick answers, >> > > You are always welcome .. Great :) Cheers, Rui _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
