Hi, On Thu, Mar 18, 2010 at 02:15:07PM +0100, Tom Tux wrote: > Hi Dejan > > hb_report -V says: > cluster-glue: 1.0.2 (b75bd738dc09263a578accc69342de2cb2eb8db6)
Yes, unfortunately that one is buggy. > I've opened a case by Novell. They will fix this problem with updating > to the newest cluster-glue-release. > > Could it be, that I have another configuration-issue in my > cluster-config? I think with the following setting, the resource > should be monitored: > > ... > ... > primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \ > meta migration-threshold="3" \ > op monitor interval="10s" timeout="20s" on-fail="restart" > op_defaults $id="op_defaults-options" \ > on-fail="restart" \ > enabled="true" > property $id="cib-bootstrap-options" \ > expected-quorum-votes="2" \ > dc-version="1.0.6-c48e3360eb18c53fd68bb7e7dbe39279ccbc0354" \ > cluster-infrastructure="openais" \ > stonith-enabled="true" \ > no-quorum-policy="ignore" \ > stonith-action="reboot" \ > last-lrm-refresh="1268838090" > ... > ... > > > And when I look the last-run-time with "crm_mon -fort1", then it results me: > MySQL_Server_Resource: migration-threshold=3 > + (32) stop: last-rc-change='Wed Mar 17 10:49:55 2010' > last-run='Wed Mar 17 10:49:55 2010' exec-time=5060ms queue-time=0ms > rc=0 (ok) > + (40) start: last-rc-change='Wed Mar 17 11:09:06 2010' > last-run='Wed Mar 17 11:09:06 2010' exec-time=4080ms queue-time=0ms > rc=0 (ok) > + (41) monitor: interval=20000ms last-rc-change='Wed Mar 17 > 11:09:10 2010' last-run='Wed Mar 17 11:09:10 2010' exec-time=20ms > queue-time=0ms rc=0 (ok) > > And the results above was yesterday.... The configuration looks fine to me. Cheers, Dejan > Thanks for your help. > Kind regards, > Tom > > > > 2010/3/18 Dejan Muhamedagic <deja...@fastmail.fm>: > > Hi, > > > > On Wed, Mar 17, 2010 at 12:38:47PM +0100, Tom Tux wrote: > >> Hi Dejan > >> > >> Thanks for your answer. > >> > >> I'm using this cluster with the packages from the HAE > >> (HighAvailability-Extension)-Repository from SLES11. Therefore, is it > >> possible, to upgrade the cluster-glue from source? > > > > Yes, though I don't think that any SLE11 version has this bug. > > When was your version released? What does hb_report -V say? > > > >> I think, the better > >> way is to wait for updates in the hae-repository from novell. Or do > >> you have experience, upgrading the cluster-glue from source (even if > >> it is installed with zypper/rpm)? > >> > >> Do you know, when the HAE-Repository will be upgraded? > > > > Can't say. Best would be if you talk to Novell about the issue. > > > > Cheers, > > > > Dejan > > > >> Thanks a lot. > >> Tom > >> > >> > >> 2010/3/17 Dejan Muhamedagic <deja...@fastmail.fm>: > >> > Hi, > >> > > >> > On Wed, Mar 17, 2010 at 10:57:16AM +0100, Tom Tux wrote: > >> >> Hi Dominik > >> >> > >> >> The problem is, that the cluster does not do the monitor-action every > >> >> 20s. The last time, when he did the action was at 09:21. And now we > >> >> have 10:37: > >> > > >> > There was a serious bug in some cluster-glue packages. What > >> > you're experiencing sounds like that. I can't say which > >> > packages (probably sth like 1.0.1, they were never released). At > >> > any rate, I'd suggest upgrading to cluster-glue 1.0.3. > >> > > >> > Thanks, > >> > > >> > Dejan > >> > > >> >> MySQL_MonitorAgent_Resource: migration-threshold=3 > >> >> + (479) stop: last-rc-change='Wed Mar 17 09:21:28 2010' > >> >> last-run='Wed Mar 17 09:21:28 2010' exec-time=3010ms queue-time=0ms > >> >> rc=0 (ok) > >> >> + (480) start: last-rc-change='Wed Mar 17 09:21:31 2010' > >> >> last-run='Wed Mar 17 09:21:31 2010' exec-time=3010ms queue-time=0ms > >> >> rc=0 (ok) > >> >> + (481) monitor: interval=10000ms last-rc-change='Wed Mar 17 > >> >> 09:21:34 2010' last-run='Wed Mar 17 09:21:34 2010' exec-time=20ms > >> >> queue-time=0ms rc=0 (ok) > >> >> > >> >> If I restart the whole cluster, then the new returncode (exit99 or > >> >> exit4) will be saw by the cluster-monitor. > >> >> > >> >> > >> >> 2010/3/17 Dominik Klein <d...@in-telegence.net>: > >> >> > Hi Tom > >> >> > > >> >> > have a look at the logs and see whether the monitor op really returns > >> >> > 99. (grep for the resource-id). If so, I'm not sure what the cluster > >> >> > does with rc=99. As far as I know, rc=4 would be status=failed > >> >> > (unknown > >> >> > actually). > >> >> > > >> >> > Regards > >> >> > Dominik > >> >> > > >> >> > Tom Tux wrote: > >> >> >> Thanks for your hint. > >> >> >> > >> >> >> I've configured an lsb-resource like this (with migration-threshold): > >> >> >> > >> >> >> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \ > >> >> >> meta target-role="Started" migration-threshold="3" \ > >> >> >> op monitor interval="10s" timeout="20s" on-fail="restart" > >> >> >> > >> >> >> I have now modified the init-script > >> >> >> "/etc/init.d/mysql-monitor-agent", > >> >> >> to exit with a returncode not equal "0" (example exit 99), when the > >> >> >> monitor-operation is querying the status. But the cluster does not > >> >> >> recognise a failed monitor-action. Why this behaviour? For the > >> >> >> cluster, everything seems ok. > >> >> >> > >> >> >> node1:/ # showcores.sh MySQL_MonitorAgent_Resource > >> >> >> Resource Score Node Stickiness > >> >> >> #Fail Migration-Threshold > >> >> >> MySQL_MonitorAgent_Resource -1000000 node1 100 0 > >> >> >> 3 > >> >> >> MySQL_MonitorAgent_Resource 100 node2 100 0 > >> >> >> 3 > >> >> >> > >> >> >> I also saw, that the "last-run"-entry (crm_mon -fort1) for this > >> >> >> resource is not up-to-date. For me it seems, that the monitor-action > >> >> >> does not occurs every 10 seconds. Why? Any hints for this behaviour? > >> >> >> > >> >> >> Thanks a lot. > >> >> >> Tom > >> >> >> > >> >> >> > >> >> >> 2010/3/16 Dominik Klein <d...@in-telegence.net>: > >> >> >>> Tom Tux wrote: > >> >> >>>> Hi > >> >> >>>> > >> >> >>>> I've have a question about the resource-monitoring: > >> >> >>>> I'm monitoring an ip-resource every 20 seconds. I have configured > >> >> >>>> the > >> >> >>>> "On Fail"-action with "restart". This works fine. If the > >> >> >>>> "monitor"-operation fails, then the resource will be restartet. > >> >> >>>> > >> >> >>>> But how can I define this resource, to migrate to the other node, > >> >> >>>> if > >> >> >>>> the resource still fails after 10 restarts? Is this possible? How > >> >> >>>> will > >> >> >>>> the "failcount" interact with this scenario? > >> >> >>>> > >> >> >>>> In the documentation I read, that the resource-"fail_count" will > >> >> >>>> encrease every time, when the resource restarts. But I can't see > >> >> >>>> this > >> >> >>>> fail_count. > >> >> >>> Look at the meta attribute "migration-threshold". > >> >> >>> > >> >> >>> Regards > >> >> >>> Dominik > >> >> > > >> >> > > >> >> > _______________________________________________ > >> >> > Pacemaker mailing list > >> >> > Pacemaker@oss.clusterlabs.org > >> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> >> > > >> >> > >> >> _______________________________________________ > >> >> Pacemaker mailing list > >> >> Pacemaker@oss.clusterlabs.org > >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > > >> > _______________________________________________ > >> > Pacemaker mailing list > >> > Pacemaker@oss.clusterlabs.org > >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > > >> > >> _______________________________________________ > >> Pacemaker mailing list > >> Pacemaker@oss.clusterlabs.org > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > _______________________________________________ > > Pacemaker mailing list > > Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker