Hi Holger, On Thu, Feb 24, 2011 at 04:28:49PM +0100, Holger Teutsch wrote: > Hi Dejan, > > On Thu, 2011-02-24 at 11:08 +0100, Dejan Muhamedagic wrote: > > Hi Holger, > > > > On Wed, Feb 23, 2011 at 06:03:21PM +0100, Holger Teutsch wrote: > > > Hi Dejan, > > > > > > On Wed, 2011-02-23 at 11:54 +0100, Dejan Muhamedagic wrote: > > > > Hi Holger, > > > > > > > > On Tue, Feb 22, 2011 at 06:25:37PM +0100, Holger Teutsch wrote: > > > > > Hi, > > > > > I resubmit the db2 agent for inclusion into the project. Besides > > > > > fixing > .... > > > > > > @@ -417,8 +445,12 @@ > > > > > ocf_log err "Possible split brain ! Manual > > > > > intervention required." > > > > > ocf_log err "If this DB is outdated use \"db2 start > > > > > hadr on db $db as standby\"" > > > > > ocf_log err "If this DB is the surviving primary use > > > > > \"db2 start hadr on db $db as primary by force\"" > > > > > - # should we return OCF_ERR_INSTALLED instead ? > > > > > - # might be a timing problem > > > > > + > > > > > + # might be a timing problem because "First active > > > > > log" is delayed > > > > > + # sleep long so we won't end up in a high speed > > > > > retry loop > > > > > + # lrmd will kill us eventually on timeout > > > > > + # on the next start attempt we might succeed when > > > > > FAL was advanced > > > > > + sleep 36000 > > > > > > > > Perhaps you should still remove this sleep. If there's nothing > > > > that can be done without administrator intervention, then better > > > > exit soon and let the cluster try to recover whichever way it can > > > > (depending also on how it is configured). > > > > > > > > > > Yes, but we can end up in a "high speed" restart loop. Instead of > > > putting in some random sleep I felt that relying on the administrator's > > > timeout choice is better. > > > > Well, the RA should always tell the truth and in this case it > > gives an impression that there was a timeout even though there > > wasn't one. What is it actually that should or shouldn't happen > > at this point? Does it want to say: "I cannot be started anymore > > on this node"? Is that just a temporary condition? BTW, even if > > it gets into a restart loop, that cannot make things any worse, > > right? I can't say really, but somehow doing an artificial > > timeout doesn't look right. > > The scenario is: > "I am a lonesome Primary and could not connect to my Standby during > startup within HADR_TIMEOUT seconds" > > So multiple causes, multiple possible resolutions... > > -> no sleep, return the truth: generic error
OK. Good. Will apply this and previous changes to the new git repository. Cheers, Dejan > Regards > Holger > > > ------------------ reference ----------------------- > --- a/db2 Wed Feb 23 18:24:59 2011 +0100 > +++ b/db2 Thu Feb 24 16:15:55 2011 +0100 > @@ -446,11 +446,11 @@ > ocf_log err "If this DB is outdated use \"db2 start hadr on > db $db as standby\"" > ocf_log err "If this DB is the surviving primary use \"db2 > start hadr on db $db as primary by force\"" > > + # might be the Standby is not yet there > # might be a timing problem because "First active log" is > delayed > - # sleep long so we won't end up in a high speed retry loop > - # lrmd will kill us eventually on timeout > - # on the next start attempt we might succeed when FAL was > advanced > - sleep 36000 > + # on the next start attempt we might succeed when FAL was > advanced > + # might be manual intervention is required > + # ... so let pacemaker give it another try and we will > succeed then > return $OCF_ERR_GENERIC > ;; > > > > _______________________________________________________ > Linux-HA-Dev: [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
