On Thu, 2009-07-02 at 09:27 -0500, David Teigland wrote: > On Thu, Jul 02, 2009 at 01:15:18PM +0200, Jan Friesse wrote: > > David Teigland wrote: > > > On Wed, Jul 01, 2009 at 01:46:03PM -0500, David Teigland wrote: > > >> other nodes should immediately recognize it has > > >> previously failed and process a complete failure for it. > > > > > > i.e. the full equivalent to what apps (using any api's) would see if the > > > node had failed via normal token timeout. > > > > More or less agree, but does this patch fixed problem for you or not? > > I haven't tried the patch, but based on the description and a quick look at > the patch, I don't think it helps. Think more broadly about what's happening > here, don't focus on one particular effect. > > 1. nodes 1,2,3,4: are cluster members > 2. nodes 1,2,3,4: are using services A,B,C,D > 3. node4: ifdown eth0, kill corosync > 4. node4: ifup eth0, start corosync > 5. node4: do not start/use any services > 6. nodes 1,2,3: never see node4 removed from membership > 7. nodes 1,2,3: services A,B,C,D never see node4 removed/fail >
Individual services have to protect against those sorts of restarts. The only other mechanism would be to break wire compatibility within Totem. This patch resolves the cpg case which is what the original bug was filed against. Regards -steve > Dave > > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
