Hi, On Thu, Feb 18, 2010 at 08:22:26PM +0530, Jayakrishnan wrote: > Hello Dejan, > > First of all thank you very much for your reply. I found that one of my node > is having the permission problem. There the permission of /var/lib/pengine > file was set to "999:999" I am not sure how!!!!!! However i changed it... > > sir, when I pull out the interface cable i am getting only this log message: > > Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now OFF (device > state 1) > > And the resource ip is not moving any where at all. It is still there in the > same machine... I acn view that the IP is still assigned to the eth0 > interface via "# ip addr show", even though the interface status is 'down.'. > Is this the split-brain?? If so how can I clear it??
With fencing (stonith). Please read some documentation available here: http://clusterlabs.org/wiki/Documentation > Because of the on-fail=standy in pgsql part in my cib I am able to do a > failover to another node when I manuallyu stop the postgres service in tha > active machine. however even after restarting the postgres service via > "/etc/init.d/postgresql-8.4 start " I have to run > crm resource cleanup <pgclone> Yes, that's necessary. > to make the crm_mon or cluster identify that the service on. Till then It is > showing as a failed action > > crm_mon snippet > -------------------------------------------------------------------- > Last updated: Thu Feb 18 20:17:28 2010 > Stack: Heartbeat > Current DC: node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105) - partition with > quorum > > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 > 2 Nodes configured, unknown expected votes > 3 Resources configured. > ============ > > Node node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105): standby (on-fail) > Online: [ node1 ] > > vir-ip (ocf::heartbeat:IPaddr2): Started node1 > slony-fail (lsb:slony_failover): Started node1 > Clone Set: pgclone > Started: [ node1 ] > Stopped: [ pgsql:0 ] > > Failed actions: > pgsql:0_monitor_15000 (node=node2, call=33, rc=7, status=complete): not > running > -------------------------------------------------------------------------------- > > Is there any way to run crm resource cleanup <resource> periodically?? Why would you want to do that? Do you expect your resources to fail regularly? > I dont know if there is any mistake in pgsql ocf script sir.. I have given > all parameters correctly but its is giving an error " syntax error" all the > time when I use it.. Best to report such a case, it's either a configuration problem (did you read its metadata) or perhaps a bug in the RA. Thanks, Dejan > I put the same meta attributes as for the current lsb > as shown below... > > Please help me out... should I reinstall the nodes again?? > > > On Thu, Feb 18, 2010 at 6:50 PM, Dejan Muhamedagic <deja...@fastmail.fm>wrote: > > > Hi, > > > > On Thu, Feb 18, 2010 at 05:09:09PM +0530, Jayakrishnan wrote: > > > sir, > > > > > > I have set up a two node cluster in Ubuntu 9.1. I have added a cluster-ip > > > using ocf:heartbeat:IPaddr2, clonned lsb script "postgresql-8.4" and also > > > added a manually created script for slony database replication. > > > > > > Now every thing works fine but I am not able to use the ocf resource > > > scripts. I mean fail over is not taking place or else even resource is > > not > > > even taking. My ha.cf file and cib configuration is attached with this > > mail > > > > > > My ha.cf file > > > > > > autojoin none > > > keepalive 2 > > > deadtime 15 > > > warntime 5 > > > initdead 64 > > > udpport 694 > > > bcast eth0 > > > auto_failback off > > > node node1 > > > node node2 > > > crm respawn > > > use_logd yes > > > > > > > > > My cib.xml configuration file in cli format: > > > > > > node $id="3952b93e-786c-47d4-8c2f-a882e3d3d105" node2 \ > > > attributes standby="off" > > > node $id="ac87f697-5b44-4720-a8af-12a6f2295930" node1 \ > > > attributes standby="off" > > > primitive pgsql lsb:postgresql-8.4 \ > > > meta target-role="Started" resource-stickness="inherited" \ > > > op monitor interval="15s" timeout="25s" on-fail="standby" > > > primitive slony-fail lsb:slony_failover \ > > > meta target-role="Started" > > > primitive vir-ip ocf:heartbeat:IPaddr2 \ > > > params ip="192.168.10.10" nic="eth0" cidr_netmask="24" > > > broadcast="192.168.10.255" \ > > > op monitor interval="15s" timeout="25s" on-fail="standby" \ > > > meta target-role="Started" > > > clone pgclone pgsql \ > > > meta notify="true" globally-unique="false" interleave="true" > > > target-role="Started" > > > colocation ip-with-slony inf: slony-fail vir-ip > > > order slony-b4-ip inf: vir-ip slony-fail > > > property $id="cib-bootstrap-options" \ > > > dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \ > > > cluster-infrastructure="Heartbeat" \ > > > no-quorum-policy="ignore" \ > > > stonith-enabled="false" \ > > > last-lrm-refresh="1266488780" > > > rsc_defaults $id="rsc-options" \ > > > resource-stickiness="INFINITY" > > > > > > > > > > > > I am assigning the cluster-ip (192.168.10.10) in eth0 with ip > > 192.168.10.129 > > > in one machine and 192.168.10.130 in another machine. > > > > > > When I pull out the eth0 interface cable fail-over is not taking place. > > > > That's split brain. More than a resource failure. Without > > stonith, you'll have both nodes running all resources. > > > > > This is the log message i am getting while I pull out the cable: > > > > > > "Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now OFF > > > (device state 1)" > > > > > > and after a miniute or two > > > > > > log snippet: > > > ------------------------------------------------------------------- > > > Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3 > > operations > > > (13333.00us average, 0% utilization) in the last 10min > > > Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped: PEngine > > Recheck > > > Timer (I_PE_CALC) just popped! > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > > cause=C_TIMER_POPPED > > > origin=crm_timer_popped ] > > > Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition: > > Progressed > > > to state S_POLICY_ENGINE after C_TIMER_POPPED > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All 2 > > > cluster nodes are eligible to run resources. > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111: > > > Requesting the current CIB: S_POLICY_ENGINE > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback: > > Invoking > > > the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1 > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On loss of > > > CCM Quorum: Ignore > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node scores: > > > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 > > > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: > > Node > > > node2 is online > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op: > > > slony-fail_monitor_0 on node2 returned 0 (ok) instead of the expected > > value: > > > 7 (not running) > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation > > > slony-fail_monitor_0 found resource slony-fail active on node2 > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op: > > > pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected value: > > 7 > > > (not running) > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation > > > pgsql:0_monitor_0 found resource pgsql:0 active on node2 > > > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: > > Node > > > node1 is online > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print: > > > vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2 > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print: > > > slony-fail#011(lsb:slony_failover):#011Started node2 > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone Set: > > > pgclone > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list: #011Started: > > [ > > > node2 node1 ] > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp: Start > > > recurring monitor (15s) for pgsql:1 on node1 > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > > resource > > > vir-ip#011(Started node2) > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > > resource > > > slony-fail#011(Started node2) > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > > resource > > > pgsql:0#011(Started node2) > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > > resource > > > pgsql:1#011(Started node1) > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State > > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > > > cause=C_IPC_MESSAGE origin=handle_response ] > > > Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked > > transition > > > 26: 1 actions in 1 synapses > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing graph > > 26 > > > (ref=pe_calc-dc-1266492773-121) derived from > > > /var/lib/pengine/pe-input-125.bz2 > > > Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command: Initiating > > action > > > 15: monitor pgsql:1_monitor_15000 on node1 > > > Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence: > > Cannout > > > open series file /var/lib/pengine/pe-input.last for writing > > > > This is probably a permission problem. /var/lib/pengine should be > > owned by haclient:hacluster. > > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message: > > Transition > > > 26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2 > > > Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event: Action > > > pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0) > > > Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph: > > > ==================================================== > > > Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition 26 > > > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, > > > Source=/var/lib/pengine/pe-input-125.bz2): Complete > > > Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger: Transition > > 26 > > > is now complete > > > Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition 26 > > > status: done - <null> > > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: State > > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > > > cause=C_FSA_INTERNAL origin=notify_crmd ] > > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: Starting > > > PEngine Recheck Timer > > > > > ------------------------------------------------------------------------------ > > > > Don't see anything in the logs about the IP address resource. > > > > > Also I am not able to use the pgsql ocf script and hence I am using the > > init > > > > Why is that? Something wrong with pgsql? If so, then it should be > > fixed. It's always much better to use the OCF instead of LSB RA. > > > > Thanks, > > > > Dejan > > > > > script and cloned it as I need to run it on both nodes for slony data > > base > > > replication. > > > > > > I am using the heartbeat and pacemaker debs from the updated ubuntu > > karmic > > > repo. (Heartbeat 2.99) > > > > > > Please check my configuration and tell me where I am missing....[?][?][?] > > > -- > > > Regards, > > > > > > Jayakrishnan. L > > > > > > Visit: www.jayakrishnan.bravehost.com > > > > > > > > > > > _______________________________________________ > > > Pacemaker mailing list > > > Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > _______________________________________________ > > Pacemaker mailing list > > Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > -- > Regards, > > Jayakrishnan. L > > Visit: www.jayakrishnan.bravehost.com > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker