Hi Dejan, On Thu, 2011-02-24 at 11:08 +0100, Dejan Muhamedagic wrote: > Hi Holger, > > On Wed, Feb 23, 2011 at 06:03:21PM +0100, Holger Teutsch wrote: > > Hi Dejan, > > > > On Wed, 2011-02-23 at 11:54 +0100, Dejan Muhamedagic wrote: > > > Hi Holger, > > > > > > On Tue, Feb 22, 2011 at 06:25:37PM +0100, Holger Teutsch wrote: > > > > Hi, > > > > I resubmit the db2 agent for inclusion into the project. Besides fixing ....
> > > > @@ -417,8 +445,12 @@
> > > > ocf_log err "Possible split brain ! Manual
> > > > intervention required."
> > > > ocf_log err "If this DB is outdated use \"db2 start
> > > > hadr on db $db as standby\""
> > > > ocf_log err "If this DB is the surviving primary use
> > > > \"db2 start hadr on db $db as primary by force\""
> > > > - # should we return OCF_ERR_INSTALLED instead ?
> > > > - # might be a timing problem
> > > > +
> > > > + # might be a timing problem because "First active log"
> > > > is delayed
> > > > + # sleep long so we won't end up in a high speed retry
> > > > loop
> > > > + # lrmd will kill us eventually on timeout
> > > > + # on the next start attempt we might succeed when FAL
> > > > was advanced
> > > > + sleep 36000
> > >
> > > Perhaps you should still remove this sleep. If there's nothing
> > > that can be done without administrator intervention, then better
> > > exit soon and let the cluster try to recover whichever way it can
> > > (depending also on how it is configured).
> > >
> >
> > Yes, but we can end up in a "high speed" restart loop. Instead of
> > putting in some random sleep I felt that relying on the administrator's
> > timeout choice is better.
>
> Well, the RA should always tell the truth and in this case it
> gives an impression that there was a timeout even though there
> wasn't one. What is it actually that should or shouldn't happen
> at this point? Does it want to say: "I cannot be started anymore
> on this node"? Is that just a temporary condition? BTW, even if
> it gets into a restart loop, that cannot make things any worse,
> right? I can't say really, but somehow doing an artificial
> timeout doesn't look right.
The scenario is:
"I am a lonesome Primary and could not connect to my Standby during
startup within HADR_TIMEOUT seconds"
So multiple causes, multiple possible resolutions...
-> no sleep, return the truth: generic error
Regards
Holger
------------------ reference -----------------------
--- a/db2 Wed Feb 23 18:24:59 2011 +0100
+++ b/db2 Thu Feb 24 16:15:55 2011 +0100
@@ -446,11 +446,11 @@
ocf_log err "If this DB is outdated use \"db2 start hadr on db
$db as standby\""
ocf_log err "If this DB is the surviving primary use \"db2
start hadr on db $db as primary by force\""
+ # might be the Standby is not yet there
# might be a timing problem because "First active log" is
delayed
- # sleep long so we won't end up in a high speed retry loop
- # lrmd will kill us eventually on timeout
- # on the next start attempt we might succeed when FAL was
advanced
- sleep 36000
+ # on the next start attempt we might succeed when FAL was
advanced
+ # might be manual intervention is required
+ # ... so let pacemaker give it another try and we will succeed
then
return $OCF_ERR_GENERIC
;;
db2
Description: application/shellscript
_______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
