On 10/4/07, Andrew W. Nosenko <[EMAIL PROTECTED]> wrote: > Heartbeat-2.1.2 > If resource (test-daemon process) killed too frequently, then > heartbeat marks this resource/process as "failed" and doesn't try to > restart this process or move it to the another node. > > If frequence of killing is low enough, then 'test-daemon' process > restarted on the same node without any problems (but doesn't try to > move it to the another node, but it seems like absolutelly different > story).
indeed - http://linux-ha.org/v2/faq/forced_failover > Interesting that after falling into this situation ('test-daemon' is > not restarted on the 'awn' node, nor migrate to the second node > 'lisiy'), the "victim" 'test-daemon' resource is restarted > authomatically on the first node ('awn') if second node goes away > (heartbeat is correctly shuted down). > > Cluster configured as symmetric, all "stickness" values are default, which is why its not being moved automagically > 'test-daemon' process have 'monitor' operation with default (absent > "on_fail" attribute). If I set "on_fail" set to "restart", then > problem doesn't go away, result is the same. right, thats the default behaviour > "Victim" 'test-daemon' > process lives under group 'test-group' on the node "awn" (at the time > of this test). > > Some race-condition in the resource recover code? > > Logs of the full cycle (from start to stop) and "cibadmin -Q" output > are attached. can you attach the following 2 files from awn: /var/lib/heartbeat/pengine/pe-warn-304.bz2 /var/lib/heartbeat/pengine/pe-warn-305.bz2 they contain exactly what the PE was working with at the time > The point of the last kill (after which 'test-daemon' was not > restarted) can be found in the ha-log.awn, line: > > Oct 4 14:22:53 awn test-daemon[6759]: Signal #15 (Terminated: 15) > received. Terminating... > > Attached files: > ha-log.awn -- log from node 'awn' (DC and node where "victim" process run) > ha-log.lisiy -- log from second node > cib.xml -- output of 'cibadmin -Q' > > 'crm_mon' cut'n'paste follows: > > ============ > Last updated: Thu Oct 4 14:23:15 2007 > Current DC: awn (2ac97182-5b64-4edb-a528-ee6d160c326a) > 2 Nodes configured. > 2 Resources configured. > ============ > > Node: awn (2ac97182-5b64-4edb-a528-ee6d160c326a): online > Node: lisiy.ua3 (9888b89c-94bb-4505-ab34-f84deced5e9d): online > > Resource Group: test-group > test-ip (heartbeat::ocf:IPaddr): Started awn > test-daemon (awn::ocf:test-daemon.ocf): Started awn FAILED > Clone Set: test-pingd-clone > test-pingd:0 (heartbeat::ocf:pingd): Started awn > test-pingd:1 (heartbeat::ocf:pingd): Started lisiy.ua3 > > Failed actions: > test-daemon_monitor_5000 (node=awn, call=17, rc=7): complete > > -----[ end of crm_mon screen]----- > > PS. Excuse me my English, please. > > -- > Andrew W. Nosenko <[EMAIL PROTECTED]> > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
