2008/5/16 Dotan Barak <[EMAIL PROTECTED]>: > Rui Machado wrote: >> >> 2008/5/17 Dotan Barak <[EMAIL PROTECTED]>: >> >>> >>> Rui Machado wrote: >>> >>>> >>>> 2008/5/16 Roland Dreier <[EMAIL PROTECTED]>: >>>> >>>> >>>>> >>>>> > hmm..... and is there no workaround for this, for this situation? I >>>>> > mean, if the server dies isn't there any possibility that >>>>> > the sender/client realizes this. If the timeout it's too large this >>>>> > can be cumbersome. >>>>> > >>>>> > I tried reducing the timeout and indeed the client realizes faster >>>>> > when the server exits but another problem arises: Without exiting >>>>> the >>>>> > server, >>>>> > on the client side I get the error (retry exceed) when polling for a >>>>> > recently posted send - this after some hours. >>>>> >>>>> There's a tradeoff between detecting real failures faster, and reducing >>>>> false errors detected because a response came too slowly. >>>>> >>>>> Clearly if a response may take an amount of time 'X' to be received >>>>> under normal conditions, there's no way to conclude that the remote >>>>> side >>>>> has failed without waiting at least 'X'. >>>>> >>>>> >>>>> >>>> >>>> I understand. So there's no really difference between the two >>>> situations, real server failure or just a load problem that takes more >>>> time? >>>> >>>> >>> >>> From the sender QP point of view, they are the same (ack/nack wasn't send >>> during a specific >>> period of time) >>> >>>> >>>> Something like a different error or a SIGPIPE :) ? >>>> >>>> I will describe my situation, maybe it helps (bare with me as I'm >>>> starting with Infiniband and so on) >>>> I have a client and a server.The clients posts RDMA calls one at a >>>> time (post, poll, post...). So server is just there. >>>> If I try to start something like 16 clients on 1 machine, after a few >>>> hours I will get an error on some client programs (retry excess) with >>>> a timeout of 14. If I increase the timeout for 32, I don't see that >>>> error but if I stop the server, the clients take a lot of time to >>>> acknowledge that, which is also not wanted. >>>> That's why I asked if there a 'good value'. If I have such a load >>>> between 2 nodes, I always have to risk that if the server dies the >>>> client will take much time to see it. That's not nice! >>>> >>>> >>> >>> Did you try to increase the retry_count too? >>> (and not only the timeout). >>> > > Yes. >> >> But that wouldn't change my scenario since the overall time is given >> by the timeout * retry count right? >> >> >>> >>> By the way, Which RDMA operation do you execute READ or WRITE? >>> >> >> READ. >> > > Can you replace it with a write (from the other side)? > READ has "higher price" than a WRITE. >
Can you please, shortly explain why this higher price? > Anyway, you should get the mentioned behavior anyway.. > > When the sender get the error, what is the status of the receiver QP? > (did you try to execute ibv_query_qp and get its status?) > I tried to get the qp state right after the error and it is 6 (which I believe is IBV_QPS_ERR). Why do you ask? Thanks Rui _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
