On 13 Apr 2007 07:37:04 -0400 Hal Rosenstock <[EMAIL PROTECTED]> wrote:
> On Fri, 2007-04-13 at 00:17, Michael S. Tsirkin wrote: > > > Quoting Ira Weiny <[EMAIL PROTECTED]>: > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > > > On Thu, 12 Apr 2007 20:16:32 +0300 > > > "Michael S. Tsirkin" <[EMAIL PROTECTED]> wrote: > > > > > > > > The job will continue running though, and when you diagnose the problem > > > > and disconnect the bad node, rate will be back to high. > > > > So what's the problem? > > > > > > Performance impact between the time it happens and diagnosing the problem. > > > Yes, disabling the node is a better solution, however, the current > > > behavior is > > > not bad for us. > > > > Hal, here we have a use case that I think shows that the right thing > > is by default to make joins succeed. Convinced? > > Didn't Ira say that "the current behavior is not bad for us" ? The > current behavior is default 4x SDR rate which makes slower joins fail. > > Are you saying change the default rate to 1x SDR ? I've been concerned > about masking performance issues when doing this as we've discussed > several times before. > Indeed I said "NOT" bad. We do NOT want the performance to come down. If this happens silently on a Friday night the cluster could run all weekend at a reduced rate. I am thinking that a check on the node's link is a good idea. It would also be able to better diagnose the problem. Thanks, Ira _______________________________________________ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
