Thanks Steve, I will give a try and let you know the results.. Actually I used 64 virtual mechines to do the experiments and I don't have real 64 physical nodes so far. :)
Javen 2010/2/23 Steven Dake <[email protected]> > On Mon, 2010-02-22 at 18:37 +0800, Javen Wu wrote: > > Hi Steve, > > > > We tried to setup a large scale corosync cluster. But found when the > > node size > 32, the cluster is not stable, the node joining may cause > > the existing cluster broken and re-configured. In my recent test, the > > breaking/re-config happened when node 46, 51, 59, 60 and 62 tried to > > join the cluster. And sometime "service corosync stop" causes the node > > 100% CPU usage and cannot recover. I have to restart the node. > > > > BTW, I curious of the token lost timeout setting. My understanding is > > the token lose timeout is to detect node lost or network partition in > > the ring. But if the Totem protocol depends on passing token in the > > ring, should the scale of the cluster is relevant to the setting of > > the token lose timeout and might impact the time for failure detection > > time? > > > > The larger the ring, the longer it takes to circulate a token. > Generally this time is about time to process and transmit > token*nodecount + (time to send multicast message*window). > > I have found on a 32 node cluster the token rotation time is around 2-3 > msec. > > More likely the problems you run into with scale have to do with the > membership protocol. Try increasing join, and consensus. Try to use a > consensus that is 5*token for such a large ring, and if that doesn't > work, try consensus=10*token. > > Your feedback on this approach is appreciated. Most people don't have > 64 node clusters to test with. > > Re shutdown and spinning, we are aware of shutdown problems in the > current code and trunk has some rework in this area. I'd suggest giving > that a try. > > Regards > -steve > > > I will do more investigation and testing to see how corosync scale to > > 64 nodes cluster. > > > > Thanks > > Javen > > > > > > > > > > > > 2010/1/13 Steven Dake <[email protected]> > > Untested at this time. > > > > Feel free to try and report your experiences. > > > > I have tested 48 nodes on physical hardware and things work > > quite well > > with a 1 sec token timeout and 5 second consensus timeout. > > > > Regards > > -steve > > > > > > On Tue, 2010-01-12 at 12:10 +0800, Javen Wu wrote: > > > Hi Folks, > > > > > > I just realize the Corosync has limitation to support 32 > > nodes as > > > maximum. > > > Is it possible we extended the limitation to support 64 > > nodes? Any > > > technical barrier? > > > > > > thanks > > > -- > > > Javen Wu > > > > > _______________________________________________ > > > Openais mailing list > > > [email protected] > > > https://lists.linux-foundation.org/mailman/listinfo/openais > > > > > > > > > > -- > > Javen Wu > > -- Javen Wu
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
