On Fri, 2007-04-13 at 00:41, Michael S. Tsirkin wrote: > > Quoting Sean Hefty <[EMAIL PROTECTED]>: > > Subject: RE: [ofa-general] Re: multicast join failed for... > > > > >> > The job will continue running though, and when you diagnose the problem > > >> > and disconnect the bad node, rate will be back to high. > > >> > So what's the problem? > > > > What would bring the rate back up? > > When the node is diagnosed and disconnected, SM will bring the rate back up.
I would say that the SM could (rather than will) bring the rate back up. This increases the implementation complexity but would be warranted if/when a dynamic rate option is supported. > > Halting all multicast traffic across the subnet to handle a flaky node > > Not halting, that would be broken. We are slowing the traffic down to avoid > congestion at this link. > > And you don't know it's "flaky" - it's just a heteroenious network. Policy can > be forced by SM option but I don't think we should assume homogenious networks > by default. Homogeneous subnets are not assumed. What is assumed is the most common use case (4x SDR or greater equipment). The issue occurs when there is a slower node attempting to join. -- Hal > > wanting > > to join some multicast group doesn't seem like a good solution. > > As I said, there are tens of ways a bad node can hurt performance, > and we don't/can't handle them. Why focus on ipoib? It's > the only way to connect to node on some fabrics, it > really must be up at all times. > > > Plus it looks > > like we'd have to repeat this later to bring the rate back up. > > So? It should all be automatic. > You see a problem in the network, diagnose it, replace the bad node, > performance comes back up. That's the way to do it. _______________________________________________ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
