On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI <keisuke.mori...@gmail.com> wrote: > Andrew, > > 2010/9/23 Andrew Beekhof <and...@beekhof.net>: >> Pushed as: >> http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 >> >> Not sure about applying to 1.0 though, its a dramatic change in behavior. > > I would like to backport this to 1.0. > Would you agree with this?
I would prefer not to, but if it is important to you then I will agree. > > Without this the failed node was not fenced when it ought to be and > failed to continue the service. > I would also think that it would be good to have the same behavior > between 1.0 and 1.1 in such a critical condition to support both > versions better. > > Thanks, > Keisuke MORI > >> >> On Wed, Sep 22, 2010 at 11:18 AM, <renayama19661...@ybb.ne.jp> wrote: >>> Hi Andrew, >>> >>> Thank you for comment. >>> >>>> A long time ago in a galaxy far away, some messaging layers used to >>>> loose quite a few actions, including stops. >>>> About the same time, we decided that fencing because a stop action was >>>> lost wasn't a good idea. >>>> >>>> The rationale was that if the operation eventually completed, it would >>>> end up in the CIB anyway. >>>> And even if it didn't, the PE would continue to try the operation >>>> again until the whole node fell over at which point it would get shot >>>> anyway. >>> >>> Sorry... >>> I did not know the fact that there was such an argument in old days. >>> >>> >>>> Now, having said that, things have improved since then and perhaps, >>>> the interest of speeding up recovery in these situations, it is time >>>> to stop treating stop operations differently. >>>> Would you agree? >>> >>> That means, you change it in the case of "Action Lost" of the stop this >>> time to carry out stonith? >>> If my recognition is right, I agree too. >>> >>> if(timer->action->type != action_type_rsc) { >>> send_update = FALSE; >>> } else if(safe_str_eq(task, "cancel")) { >>> /* we dont need to update the CIB with these */ >>> send_update = FALSE; >>> } >>> ---> delete "else if(safe_str_eq(task, "stop")){..}" ? >>> >>> if(send_update) { >>> /* cib_action_update(timer->action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); >>> */ >>> cib_action_update(timer->action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR); >>> } >>> >>> Best Regards, >>> Hideo Yamauchi. >>> >>> --- Andrew Beekhof <and...@beekhof.net> wrote: >>> >>>> On Tue, Sep 21, 2010 at 8:59 AM, <renayama19661...@ybb.ne.jp> wrote: >>>> > Hi, >>>> > >>>> > Node was in state that the load was very high, and we confirmed monitor >>>> > movement of Pacemeker. >>>> > Action Lost occurred in stop movement after the error of the monitor >>>> > occurred. >>>> > >>>> > Sep �8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting >>>> > transition, action lost: >>>> [Action 9]: >>>> > In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) >>>> > Sep �8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: >>>> > action_timer_callback:486 >>> - >>>> > Triggered transition abort (complete=0) : Action lost >>>> > >>>> > >>>> > For the load of the node, We think that the stop movement did not go >>>> > well. >>>> > But cannot nodes execute stonith. >>>> >>>> A long time ago in a galaxy far away, some messaging layers used to >>>> loose quite a few actions, including stops. >>>> About the same time, we decided that fencing because a stop action was >>>> lost wasn't a good idea. >>>> >>>> The rationale was that if the operation eventually completed, it would >>>> end up in the CIB anyway. >>>> And even if it didn't, the PE would continue to try the operation >>>> again until the whole node fell over at which point it would get shot >>>> anyway. >>>> >>>> Now, having said that, things have improved since then and perhaps, >>>> the interest of speeding up recovery in these situations, it is time >>>> to stop treating stop operations differently. >>>> Would you agree? >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: >>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >>>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >>> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> > > > > -- > Keisuke MORI > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker