Re: [ofa-general] Re: multicast join failed for...

Ira Weiny Fri, 13 Apr 2007 09:08:45 -0700

On 13 Apr 2007 07:37:04 -0400
Hal Rosenstock <[EMAIL PROTECTED]> wrote:


> On Fri, 2007-04-13 at 00:17, Michael S. Tsirkin wrote:
>
> > Quoting Ira Weiny <[EMAIL PROTECTED]>:
> > Subject: Re: [ofa-general] Re: multicast join failed for...
> > > 
> > > On Thu, 12 Apr 2007 20:16:32 +0300
> > > "Michael S. Tsirkin" <[EMAIL PROTECTED]> wrote:
> > > > 
> > > > The job will continue running though, and when you diagnose the problem
> > > > and disconnect the bad node, rate will be back to high.
> > > > So what's the problem?
> > > 
> > > Performance impact between the time it happens and diagnosing the problem.
> > > Yes, disabling the node is a better solution, however, the current 
> > > behavior is
> > > not bad for us.
> > 
> > Hal, here we have a use case that I think shows that the right thing
> > is by default to make joins succeed. Convinced?
> 
> Didn't Ira say that "the current behavior is not bad for us" ? The
> current behavior is default 4x SDR rate which makes slower joins fail.
> 
> Are you saying change the default rate to 1x SDR ? I've been concerned
> about masking performance issues when doing this as we've discussed
> several times before.
> 

Indeed I said "NOT" bad.  We do NOT want the performance to come down.  If this
happens silently on a Friday night the cluster could run all weekend at a
reduced rate.

I am thinking that a check on the node's link is a good idea.  It would also be
able to better diagnose the problem.

Thanks,
Ira
_______________________________________________
general mailing list
[EMAIL PROTECTED]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: multicast join failed for...

Reply via email to