10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: > On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> wrote: >> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>> Hi, ALL. >>>> >>>> I'm still trying to cope with the fact that after the fence - node >>>> hangs in "pending". >>> Please define "pending". Where did you see this? >> In crm_mon: >> ...... >> Node dev-cluster2-node2 (172793105): pending >> ...... >> >> The experiment was like this: >> Four nodes in cluster. >> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). >> Thereafter, the remaining start it constantly reboot, under various >> pretexts, "softly whistling", "fly low", "not a cluster member!" ... >> Then in the log fell out "Too many failures ...." >> All this time in the status in crm_mon is "pending". >> Depending on the wind direction changed to "UNCLEAN" >> Much time has passed and I can not accurately describe the behavior... >> >> Now I am in the following state: >> I tried locate the problem. Came here with this. >> I set big value in property stonith-timeout="600s". >> And got the following behavior: >> 1. pkill -4 corosync >> 2. from node with DC call my fence agent "sshbykey" >> 3. It sends reboot victim and waits until she comes to life again. > Hmmm.... what version of pacemaker? > This sounds like a timing issue that we fixed a while back
Was a version 1.1.11 from December 3. Now try full update and retest. >> Once the script makes sure that the victim will rebooted and again >> available via ssh - it exit with 0. >> All command is logged both the victim and the killer - all right. >> 4. A little later, the status of the (victim) nodes in crm_mon changes to >> online. >> 5. BUT... not one resource don't start! Despite the fact that >> "crm_simalate -sL" shows the correct resource to start: >> * Start pingCheck:3 (dev-cluster2-node2) >> 6. In this state, we spend the next 600 seconds. >> After completing this timeout causes another node (not DC) decides to >> kill again our victim. >> All command again is logged both the victim and the killer - All >> documented :) >> 7. NOW all resource started in right sequence. >> >> I almost happy, but I do not like: two reboots and 10 minutes of waiting ;) >> And if something happens on another node, this the behavior is >> superimposed on old and not any resources not start until the last node will >> not reload twice. >> >> I tried understood this behavior. >> As I understand it: >> 1. Ultimately, in ./lib/fencing/st_client.c call >> internal_stonith_action_execute(). >> 2. It make fork and pipe from tham. >> 3. Async call mainloop_child_add with callback to >> stonith_action_async_done. >> 4. Add timeout g_timeout_add to TERM and KILL signals. >> >> If all right must - call stonith_action_async_done, remove timeout. >> For some reason this does not happen. I sit and think .... >>>> At this time, there are constant re-election. >>>> Also, I noticed the difference when you start pacemaker. >>>> At normal startup: >>>> * corosync >>>> * pacemakerd >>>> * attrd >>>> * pengine >>>> * lrmd >>>> * crmd >>>> * cib >>>> >>>> When hangs start: >>>> * corosync >>>> * pacemakerd >>>> * attrd >>>> * pengine >>>> * crmd >>>> * lrmd >>>> * cib. >>> Are you referring to the order of the daemons here? >>> The cib should not be at the bottom in either case. >>>> Who knows who runs lrmd? >>> Pacemakerd. >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> , >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org