On Thu, 12 Apr 2007 07:21:55 +0300 "Michael S. Tsirkin" <[EMAIL PROTECTED]> wrote:
> > Quoting Ira Weiny <[EMAIL PROTECTED]>: > > Subject: Re: [ofa-general] Re: multicast join failed for... > > > > On 11 Apr 2007 17:45:54 -0400 > > Hal Rosenstock <[EMAIL PROTECTED]> wrote: > > > > > On Wed, 2007-04-11 at 15:47, Michael S. Tsirkin wrote: > > > > > > > - previously we had some client failing join > > > > which is worse. > > > > > > Maybe not. Maybe that's what the admin wants (to keep the higher rate > > > rather than degrade the group due to some link issue). > > > > > > > Indeed, on a big cluster it would be better to have a few nodes dropped out > > than to limit the speed of the entire cluster. > > Why are you joining these nodes then? > Anyway, could always be an option. > We have seen a specific example where a nodes 4X link comes up at 1X. In this case we would want the join to fail. Basically a single hardware error, isolated to 1 node, should not affect the other 1150 nodes, which could very well be running a users job. Certainly if there is a heterogeneous network we would want different behavior but we don't operate any of our clusters like that. After reading todays posts I think it should be an option. If someone has a mixture they can configure it. I am not sure what the default should be though. I know we would want the join to fail, but I understand the argument to allow it to work. Ira _______________________________________________ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
