Hi >From Novell-Support I got a PTF (Program Temporary Fix) which should handle this issue.
I'm thinking, that the monitoring is working now. But I'm irritated with the output of the command "crm_mon -t1", which shows me the "last-rc-change" and the "last-run" of the monitor-operation. I have defined the monitor-operation for an certain resource every 10 seconds, but the "last-run"-field of the "crm_mon -t1"-output doesn't change it's value. It changes it's value only, when he got no returncode with value "0" back and the failcount will be increased. Is this behaviour correct? Thanks a lot for your help. Kind regards, Tom 2010/3/19 Tom Tux <tomtu...@gmail.com>: > Hi > > Thanks a lot for your help. > > So now it's Novell's turn.....:-) > > Regards, > Tom > > > 2010/3/18 Dejan Muhamedagic <deja...@fastmail.fm>: >> Hi, >> >> On Thu, Mar 18, 2010 at 02:15:07PM +0100, Tom Tux wrote: >>> Hi Dejan >>> >>> hb_report -V says: >>> cluster-glue: 1.0.2 (b75bd738dc09263a578accc69342de2cb2eb8db6) >> >> Yes, unfortunately that one is buggy. >> >>> I've opened a case by Novell. They will fix this problem with updating >>> to the newest cluster-glue-release. >>> >>> Could it be, that I have another configuration-issue in my >>> cluster-config? I think with the following setting, the resource >>> should be monitored: >>> >>> ... >>> ... >>> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \ >>> meta migration-threshold="3" \ >>> op monitor interval="10s" timeout="20s" on-fail="restart" >>> op_defaults $id="op_defaults-options" \ >>> on-fail="restart" \ >>> enabled="true" >>> property $id="cib-bootstrap-options" \ >>> expected-quorum-votes="2" \ >>> dc-version="1.0.6-c48e3360eb18c53fd68bb7e7dbe39279ccbc0354" \ >>> cluster-infrastructure="openais" \ >>> stonith-enabled="true" \ >>> no-quorum-policy="ignore" \ >>> stonith-action="reboot" \ >>> last-lrm-refresh="1268838090" >>> ... >>> ... >>> >>> >>> And when I look the last-run-time with "crm_mon -fort1", then it results me: >>> MySQL_Server_Resource: migration-threshold=3 >>> + (32) stop: last-rc-change='Wed Mar 17 10:49:55 2010' >>> last-run='Wed Mar 17 10:49:55 2010' exec-time=5060ms queue-time=0ms >>> rc=0 (ok) >>> + (40) start: last-rc-change='Wed Mar 17 11:09:06 2010' >>> last-run='Wed Mar 17 11:09:06 2010' exec-time=4080ms queue-time=0ms >>> rc=0 (ok) >>> + (41) monitor: interval=20000ms last-rc-change='Wed Mar 17 >>> 11:09:10 2010' last-run='Wed Mar 17 11:09:10 2010' exec-time=20ms >>> queue-time=0ms rc=0 (ok) >>> >>> And the results above was yesterday.... >> >> The configuration looks fine to me. >> >> Cheers, >> >> Dejan >> >>> Thanks for your help. >>> Kind regards, >>> Tom >>> >>> >>> >>> 2010/3/18 Dejan Muhamedagic <deja...@fastmail.fm>: >>> > Hi, >>> > >>> > On Wed, Mar 17, 2010 at 12:38:47PM +0100, Tom Tux wrote: >>> >> Hi Dejan >>> >> >>> >> Thanks for your answer. >>> >> >>> >> I'm using this cluster with the packages from the HAE >>> >> (HighAvailability-Extension)-Repository from SLES11. Therefore, is it >>> >> possible, to upgrade the cluster-glue from source? >>> > >>> > Yes, though I don't think that any SLE11 version has this bug. >>> > When was your version released? What does hb_report -V say? >>> > >>> >> I think, the better >>> >> way is to wait for updates in the hae-repository from novell. Or do >>> >> you have experience, upgrading the cluster-glue from source (even if >>> >> it is installed with zypper/rpm)? >>> >> >>> >> Do you know, when the HAE-Repository will be upgraded? >>> > >>> > Can't say. Best would be if you talk to Novell about the issue. >>> > >>> > Cheers, >>> > >>> > Dejan >>> > >>> >> Thanks a lot. >>> >> Tom >>> >> >>> >> >>> >> 2010/3/17 Dejan Muhamedagic <deja...@fastmail.fm>: >>> >> > Hi, >>> >> > >>> >> > On Wed, Mar 17, 2010 at 10:57:16AM +0100, Tom Tux wrote: >>> >> >> Hi Dominik >>> >> >> >>> >> >> The problem is, that the cluster does not do the monitor-action every >>> >> >> 20s. The last time, when he did the action was at 09:21. And now we >>> >> >> have 10:37: >>> >> > >>> >> > There was a serious bug in some cluster-glue packages. What >>> >> > you're experiencing sounds like that. I can't say which >>> >> > packages (probably sth like 1.0.1, they were never released). At >>> >> > any rate, I'd suggest upgrading to cluster-glue 1.0.3. >>> >> > >>> >> > Thanks, >>> >> > >>> >> > Dejan >>> >> > >>> >> >> MySQL_MonitorAgent_Resource: migration-threshold=3 >>> >> >> + (479) stop: last-rc-change='Wed Mar 17 09:21:28 2010' >>> >> >> last-run='Wed Mar 17 09:21:28 2010' exec-time=3010ms queue-time=0ms >>> >> >> rc=0 (ok) >>> >> >> + (480) start: last-rc-change='Wed Mar 17 09:21:31 2010' >>> >> >> last-run='Wed Mar 17 09:21:31 2010' exec-time=3010ms queue-time=0ms >>> >> >> rc=0 (ok) >>> >> >> + (481) monitor: interval=10000ms last-rc-change='Wed Mar 17 >>> >> >> 09:21:34 2010' last-run='Wed Mar 17 09:21:34 2010' exec-time=20ms >>> >> >> queue-time=0ms rc=0 (ok) >>> >> >> >>> >> >> If I restart the whole cluster, then the new returncode (exit99 or >>> >> >> exit4) will be saw by the cluster-monitor. >>> >> >> >>> >> >> >>> >> >> 2010/3/17 Dominik Klein <d...@in-telegence.net>: >>> >> >> > Hi Tom >>> >> >> > >>> >> >> > have a look at the logs and see whether the monitor op really >>> >> >> > returns >>> >> >> > 99. (grep for the resource-id). If so, I'm not sure what the cluster >>> >> >> > does with rc=99. As far as I know, rc=4 would be status=failed >>> >> >> > (unknown >>> >> >> > actually). >>> >> >> > >>> >> >> > Regards >>> >> >> > Dominik >>> >> >> > >>> >> >> > Tom Tux wrote: >>> >> >> >> Thanks for your hint. >>> >> >> >> >>> >> >> >> I've configured an lsb-resource like this (with >>> >> >> >> migration-threshold): >>> >> >> >> >>> >> >> >> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \ >>> >> >> >> meta target-role="Started" migration-threshold="3" \ >>> >> >> >> op monitor interval="10s" timeout="20s" on-fail="restart" >>> >> >> >> >>> >> >> >> I have now modified the init-script >>> >> >> >> "/etc/init.d/mysql-monitor-agent", >>> >> >> >> to exit with a returncode not equal "0" (example exit 99), when the >>> >> >> >> monitor-operation is querying the status. But the cluster does not >>> >> >> >> recognise a failed monitor-action. Why this behaviour? For the >>> >> >> >> cluster, everything seems ok. >>> >> >> >> >>> >> >> >> node1:/ # showcores.sh MySQL_MonitorAgent_Resource >>> >> >> >> Resource Score Node Stickiness >>> >> >> >> #Fail Migration-Threshold >>> >> >> >> MySQL_MonitorAgent_Resource -1000000 node1 100 0 >>> >> >> >> 3 >>> >> >> >> MySQL_MonitorAgent_Resource 100 node2 100 0 >>> >> >> >> 3 >>> >> >> >> >>> >> >> >> I also saw, that the "last-run"-entry (crm_mon -fort1) for this >>> >> >> >> resource is not up-to-date. For me it seems, that the >>> >> >> >> monitor-action >>> >> >> >> does not occurs every 10 seconds. Why? Any hints for this >>> >> >> >> behaviour? >>> >> >> >> >>> >> >> >> Thanks a lot. >>> >> >> >> Tom >>> >> >> >> >>> >> >> >> >>> >> >> >> 2010/3/16 Dominik Klein <d...@in-telegence.net>: >>> >> >> >>> Tom Tux wrote: >>> >> >> >>>> Hi >>> >> >> >>>> >>> >> >> >>>> I've have a question about the resource-monitoring: >>> >> >> >>>> I'm monitoring an ip-resource every 20 seconds. I have >>> >> >> >>>> configured the >>> >> >> >>>> "On Fail"-action with "restart". This works fine. If the >>> >> >> >>>> "monitor"-operation fails, then the resource will be restartet. >>> >> >> >>>> >>> >> >> >>>> But how can I define this resource, to migrate to the other >>> >> >> >>>> node, if >>> >> >> >>>> the resource still fails after 10 restarts? Is this possible? >>> >> >> >>>> How will >>> >> >> >>>> the "failcount" interact with this scenario? >>> >> >> >>>> >>> >> >> >>>> In the documentation I read, that the resource-"fail_count" will >>> >> >> >>>> encrease every time, when the resource restarts. But I can't see >>> >> >> >>>> this >>> >> >> >>>> fail_count. >>> >> >> >>> Look at the meta attribute "migration-threshold". >>> >> >> >>> >>> >> >> >>> Regards >>> >> >> >>> Dominik >>> >> >> > >>> >> >> > >>> >> >> > _______________________________________________ >>> >> >> > Pacemaker mailing list >>> >> >> > Pacemaker@oss.clusterlabs.org >>> >> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >> >> > >>> >> >> >>> >> >> _______________________________________________ >>> >> >> Pacemaker mailing list >>> >> >> Pacemaker@oss.clusterlabs.org >>> >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >> > >>> >> > _______________________________________________ >>> >> > Pacemaker mailing list >>> >> > Pacemaker@oss.clusterlabs.org >>> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >> > >>> >> >>> >> _______________________________________________ >>> >> Pacemaker mailing list >>> >> Pacemaker@oss.clusterlabs.org >>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> > >>> > _______________________________________________ >>> > Pacemaker mailing list >>> > Pacemaker@oss.clusterlabs.org >>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> > >>> >>> _______________________________________________ >>> Pacemaker mailing list >>> Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> _______________________________________________ >> Pacemaker mailing list >> Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> > _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker