Thanks Steve,

I will give a try and let you know the results.. Actually I used 64 virtual
mechines to do the experiments and I don't have real 64 physical nodes so
far. :)

Javen

2010/2/23 Steven Dake <[email protected]>

> On Mon, 2010-02-22 at 18:37 +0800, Javen Wu wrote:
> > Hi Steve,
> >
> > We tried to setup a large scale corosync cluster. But found when the
> > node size > 32, the cluster is not stable,  the node joining may cause
> > the existing cluster broken and re-configured. In my recent test, the
> > breaking/re-config happened when node 46, 51, 59, 60 and 62 tried to
> > join the cluster. And sometime "service corosync stop" causes the node
> > 100% CPU usage and cannot recover. I have to restart the node.
> >
> > BTW, I curious of the token lost timeout setting. My understanding is
> > the token lose timeout is to detect node lost or network partition in
> > the ring. But if the Totem protocol depends on passing token in the
> > ring, should the scale of the cluster is relevant to the setting of
> > the token lose timeout and might impact the time for failure detection
> > time?
> >
>
> The larger the ring, the longer it takes to circulate a token.
> Generally this time is about time to process and transmit
> token*nodecount + (time to send multicast message*window).
>
> I have found on a 32 node cluster the token rotation time is around 2-3
> msec.
>
> More likely the problems you run into with scale have to do with the
> membership protocol.  Try increasing join, and consensus.  Try to use a
> consensus that is 5*token for such a large ring, and if that doesn't
> work, try consensus=10*token.
>
> Your feedback on this approach is appreciated.  Most people don't have
> 64 node clusters to test with.
>
> Re shutdown and spinning, we are aware of shutdown problems in the
> current code and trunk has some rework in this area.  I'd suggest giving
> that a try.
>
> Regards
> -steve
>
> > I will do more investigation and testing to see how corosync scale to
> > 64 nodes cluster.
> >
> > Thanks
> > Javen
> >
> >
> >
> >
> >
> > 2010/1/13 Steven Dake <[email protected]>
> >         Untested at this time.
> >
> >         Feel free to try and report your experiences.
> >
> >         I have tested 48 nodes on physical hardware and things work
> >         quite well
> >         with a 1 sec token timeout and 5 second consensus timeout.
> >
> >         Regards
> >         -steve
> >
> >
> >         On Tue, 2010-01-12 at 12:10 +0800, Javen Wu wrote:
> >         > Hi Folks,
> >         >
> >         > I just realize the Corosync has limitation to support 32
> >         nodes as
> >         > maximum.
> >         > Is it possible we extended the limitation to support 64
> >         nodes? Any
> >         > technical barrier?
> >         >
> >         > thanks
> >         > --
> >         > Javen Wu
> >
> >         > _______________________________________________
> >         > Openais mailing list
> >         > [email protected]
> >         > https://lists.linux-foundation.org/mailman/listinfo/openais
> >
> >
> >
> >
> > --
> > Javen Wu
>
>


-- 
Javen Wu
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to