Hi, On Wed, Mar 17, 2010 at 12:38:47PM +0100, Tom Tux wrote: > Hi Dejan > > Thanks for your answer. > > I'm using this cluster with the packages from the HAE > (HighAvailability-Extension)-Repository from SLES11. Therefore, is it > possible, to upgrade the cluster-glue from source?
Yes, though I don't think that any SLE11 version has this bug. When was your version released? What does hb_report -V say? > I think, the better > way is to wait for updates in the hae-repository from novell. Or do > you have experience, upgrading the cluster-glue from source (even if > it is installed with zypper/rpm)? > > Do you know, when the HAE-Repository will be upgraded? Can't say. Best would be if you talk to Novell about the issue. Cheers, Dejan > Thanks a lot. > Tom > > > 2010/3/17 Dejan Muhamedagic <deja...@fastmail.fm>: > > Hi, > > > > On Wed, Mar 17, 2010 at 10:57:16AM +0100, Tom Tux wrote: > >> Hi Dominik > >> > >> The problem is, that the cluster does not do the monitor-action every > >> 20s. The last time, when he did the action was at 09:21. And now we > >> have 10:37: > > > > There was a serious bug in some cluster-glue packages. What > > you're experiencing sounds like that. I can't say which > > packages (probably sth like 1.0.1, they were never released). At > > any rate, I'd suggest upgrading to cluster-glue 1.0.3. > > > > Thanks, > > > > Dejan > > > >> MySQL_MonitorAgent_Resource: migration-threshold=3 > >> + (479) stop: last-rc-change='Wed Mar 17 09:21:28 2010' > >> last-run='Wed Mar 17 09:21:28 2010' exec-time=3010ms queue-time=0ms > >> rc=0 (ok) > >> + (480) start: last-rc-change='Wed Mar 17 09:21:31 2010' > >> last-run='Wed Mar 17 09:21:31 2010' exec-time=3010ms queue-time=0ms > >> rc=0 (ok) > >> + (481) monitor: interval=10000ms last-rc-change='Wed Mar 17 > >> 09:21:34 2010' last-run='Wed Mar 17 09:21:34 2010' exec-time=20ms > >> queue-time=0ms rc=0 (ok) > >> > >> If I restart the whole cluster, then the new returncode (exit99 or > >> exit4) will be saw by the cluster-monitor. > >> > >> > >> 2010/3/17 Dominik Klein <d...@in-telegence.net>: > >> > Hi Tom > >> > > >> > have a look at the logs and see whether the monitor op really returns > >> > 99. (grep for the resource-id). If so, I'm not sure what the cluster > >> > does with rc=99. As far as I know, rc=4 would be status=failed (unknown > >> > actually). > >> > > >> > Regards > >> > Dominik > >> > > >> > Tom Tux wrote: > >> >> Thanks for your hint. > >> >> > >> >> I've configured an lsb-resource like this (with migration-threshold): > >> >> > >> >> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \ > >> >> meta target-role="Started" migration-threshold="3" \ > >> >> op monitor interval="10s" timeout="20s" on-fail="restart" > >> >> > >> >> I have now modified the init-script "/etc/init.d/mysql-monitor-agent", > >> >> to exit with a returncode not equal "0" (example exit 99), when the > >> >> monitor-operation is querying the status. But the cluster does not > >> >> recognise a failed monitor-action. Why this behaviour? For the > >> >> cluster, everything seems ok. > >> >> > >> >> node1:/ # showcores.sh MySQL_MonitorAgent_Resource > >> >> Resource Score Node Stickiness > >> >> #Fail Migration-Threshold > >> >> MySQL_MonitorAgent_Resource -1000000 node1 100 0 > >> >> 3 > >> >> MySQL_MonitorAgent_Resource 100 node2 100 0 > >> >> 3 > >> >> > >> >> I also saw, that the "last-run"-entry (crm_mon -fort1) for this > >> >> resource is not up-to-date. For me it seems, that the monitor-action > >> >> does not occurs every 10 seconds. Why? Any hints for this behaviour? > >> >> > >> >> Thanks a lot. > >> >> Tom > >> >> > >> >> > >> >> 2010/3/16 Dominik Klein <d...@in-telegence.net>: > >> >>> Tom Tux wrote: > >> >>>> Hi > >> >>>> > >> >>>> I've have a question about the resource-monitoring: > >> >>>> I'm monitoring an ip-resource every 20 seconds. I have configured the > >> >>>> "On Fail"-action with "restart". This works fine. If the > >> >>>> "monitor"-operation fails, then the resource will be restartet. > >> >>>> > >> >>>> But how can I define this resource, to migrate to the other node, if > >> >>>> the resource still fails after 10 restarts? Is this possible? How will > >> >>>> the "failcount" interact with this scenario? > >> >>>> > >> >>>> In the documentation I read, that the resource-"fail_count" will > >> >>>> encrease every time, when the resource restarts. But I can't see this > >> >>>> fail_count. > >> >>> Look at the meta attribute "migration-threshold". > >> >>> > >> >>> Regards > >> >>> Dominik > >> > > >> > > >> > _______________________________________________ > >> > Pacemaker mailing list > >> > Pacemaker@oss.clusterlabs.org > >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > > >> > >> _______________________________________________ > >> Pacemaker mailing list > >> Pacemaker@oss.clusterlabs.org > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > _______________________________________________ > > Pacemaker mailing list > > Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker