Pushkar, here are logs from both servers. They seem to both think that "other holds resources", that's my read on the situation.
Any help will be appreciated. Thank you Igor ======================================================================== r...@pfs-srv3:~# tail -40 /var/log/ha-log Jul 27 12:03:38 pfs-srv3 heartbeat: [1430]: info: heartbeat: version 3.0.2 Jul 27 12:03:39 pfs-srv3 heartbeat: [1430]: info: Heartbeat generation: 1279723736 Jul 27 12:03:39 pfs-srv3 heartbeat: [1430]: info: glib: UDP Broadcast heartbeat started on port 12694 (12694) interface eth1 Jul 27 12:03:39 pfs-srv3 heartbeat: [1430]: info: glib: UDP Broadcast heartbeat closed on port 12694 interface eth1 - Status: 1 Jul 27 12:03:39 pfs-srv3 heartbeat: [1430]: info: G_main_add_TriggerHandler: Added signal manual handler Jul 27 12:03:39 pfs-srv3 heartbeat: [1430]: info: G_main_add_TriggerHandler: Added signal manual handler Jul 27 12:03:39 pfs-srv3 heartbeat: [1430]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Jul 27 12:03:39 pfs-srv3 heartbeat: [1430]: info: Local status now set to: 'up' Jul 27 12:03:39 pfs-srv3 heartbeat: [1430]: info: Link pfs-srv3:eth1 up. Jul 27 12:03:39 pfs-srv3 heartbeat: [1430]: info: Managed write_hostcachedata process 1483 exited with return code 0. Jul 27 12:03:41 pfs-srv3 heartbeat: [1430]: info: Link pfs-srv4:eth1 up. Jul 27 12:03:41 pfs-srv3 heartbeat: [1430]: info: Status update for node pfs-srv4: status up Jul 27 12:03:41 pfs-srv3 heartbeat: [1430]: info: Managed write_hostcachedata process 1486 exited with return code 0. Jul 27 12:03:42 pfs-srv3 harc[1485]: [1492]: info: Running /etc/ha.d//rc.d/status status Jul 27 12:03:42 pfs-srv3 heartbeat: [1430]: info: Managed status process 1485 exited with return code 0. Jul 27 12:03:42 pfs-srv3 heartbeat: [1430]: info: Comm_now_up(): updating status to active Jul 27 12:03:42 pfs-srv3 heartbeat: [1430]: info: Local status now set to: 'active' Jul 27 12:03:42 pfs-srv3 heartbeat: [1430]: info: Managed write_hostcachedata process 1498 exited with return code 0. Jul 27 12:03:43 pfs-srv3 heartbeat: [1430]: info: Managed write_delcachedata process 1499 exited with return code 0. Jul 27 12:03:43 pfs-srv3 heartbeat: [1430]: info: Status update for node pfs-srv4: status active Jul 27 12:03:43 pfs-srv3 heartbeat: [1430]: info: AnnounceTakeover(local 0, foreign 1, reason 'HB_R_BOTHSTARTING' (0)) Jul 27 12:03:43 pfs-srv3 heartbeat: [1430]: info: STATE 1 => 3 Jul 27 12:03:43 pfs-srv3 heartbeat: [1430]: info: STATE 3 => 2 Jul 27 12:03:43 pfs-srv3 heartbeat: [1430]: info: other_holds_resources: 0 Jul 27 12:03:43 pfs-srv3 heartbeat: [1430]: info: STATE 2 => 3 Jul 27 12:03:43 pfs-srv3 harc[1500]: [1506]: info: Running /etc/ha.d//rc.d/status status Jul 27 12:03:43 pfs-srv3 heartbeat: [1430]: info: Managed status process 1500 exited with return code 0. Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: local resource transition completed. Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0)) Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: Initial resource acquisition complete (T_RESOURCES(us)) Jul 27 12:03:53 pfs-srv3 heartbeat: [1512]: info: 1 local resources from [/usr/share/heartbeat/ResourceManager listkeys pfs-srv3] Jul 27 12:03:53 pfs-srv3 heartbeat: [1512]: info: Local Resource acquisition completed. Jul 27 12:03:53 pfs-srv3 heartbeat: [1512]: info: FIFO message [type resource] written rc=81 Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: other_holds_resources: 0 Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: remote resource transition completed. Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: other_holds_resources: 1 Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: other_holds_resources: 1 Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) Jul 27 12:03:53 pfs-srv3 heartbeat: [1430]: info: Managed req_our_resources(ask) process 1512 exited with return code 0. ================================================================================== r...@pfs-srv4:~# tail -40 /var/log/ha-log Jul 27 12:03:34 pfs-srv4 heartbeat: [1249]: info: heartbeat: version 3.0.2 Jul 27 12:03:35 pfs-srv4 heartbeat: [1249]: info: Heartbeat generation: 1279723741 Jul 27 12:03:35 pfs-srv4 heartbeat: [1249]: info: glib: UDP Broadcast heartbeat started on port 12694 (12694) interface eth1 Jul 27 12:03:35 pfs-srv4 heartbeat: [1249]: info: glib: UDP Broadcast heartbeat closed on port 12694 interface eth1 - Status: 1 Jul 27 12:03:35 pfs-srv4 heartbeat: [1249]: info: G_main_add_TriggerHandler: Added signal manual handler Jul 27 12:03:35 pfs-srv4 heartbeat: [1249]: info: G_main_add_TriggerHandler: Added signal manual handler Jul 27 12:03:35 pfs-srv4 heartbeat: [1249]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Jul 27 12:03:35 pfs-srv4 heartbeat: [1249]: info: Local status now set to: 'up' Jul 27 12:03:35 pfs-srv4 heartbeat: [1249]: info: Managed write_hostcachedata process 1284 exited with return code 0. Jul 27 12:03:36 pfs-srv4 heartbeat: [1249]: info: Link pfs-srv4:eth1 up. Jul 27 12:03:36 pfs-srv4 heartbeat: [1249]: info: Link pfs-srv3:eth1 up. Jul 27 12:03:36 pfs-srv4 heartbeat: [1249]: info: Status update for node pfs-srv3: status up Jul 27 12:03:36 pfs-srv4 heartbeat: [1249]: info: Managed write_hostcachedata process 1286 exited with return code 0. Jul 27 12:03:36 pfs-srv4 harc[1285]: [1292]: info: Running /etc/ha.d//rc.d/status status Jul 27 12:03:36 pfs-srv4 heartbeat: [1249]: info: Managed status process 1285 exited with return code 0. Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: Comm_now_up(): updating status to active Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: Local status now set to: 'active' Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: Status update for node pfs-srv3: status active Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: AnnounceTakeover(local 0, foreign 1, reason 'HB_R_BOTHSTARTING' (0)) Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: STATE 1 => 3 Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: STATE 3 => 2 Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: Managed write_hostcachedata process 1298 exited with return code 0. Jul 27 12:03:37 pfs-srv4 harc[1297]: [1305]: info: Running /etc/ha.d//rc.d/status status Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: Managed status process 1297 exited with return code 0. Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: other_holds_resources: 0 Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: STATE 2 => 3 Jul 27 12:03:37 pfs-srv4 heartbeat: [1249]: info: Managed write_delcachedata process 1299 exited with return code 0. Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: remote resource transition completed. Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: other_holds_resources: 1 Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: remote resource transition completed. Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0)) Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: Initial resource acquisition complete (T_RESOURCES(us)) Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(them)' (1)) Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: STATE 3 => 4 Jul 27 12:03:53 pfs-srv4 heartbeat: [1315]: info: No local resources [/usr/share/heartbeat/ResourceManager listkeys pfs-srv4] to acquire. Jul 27 12:03:53 pfs-srv4 heartbeat: [1315]: info: FIFO message [type resource] written rc=81 Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: Managed req_our_resources(ask) process 1315 exited with return code 0. Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: other_holds_resources: 1 Jul 27 12:03:53 pfs-srv4 heartbeat: [1249]: info: other_holds_resources: 1 On Mon, Jul 26, 2010 at 1:04 PM, Pushkar Pradhan <[email protected]> wrote: > > ________________________________ > > From: [email protected] on behalf of Igor Chudov > Sent: Mon 7/26/2010 7:16 AM > To: [email protected] > Subject: [Linux-HA] Big progress with Heartbeat,but simultaneous reboot > leaves services unprovided > > > > I am setting up a two node cluster, using drbd and Heartbeat. I use > standard packages on Ubuntu Hardy. > > The services being provided externally is a NFS and samba share that > is on top of the DRBD filesystem, and the service IP address. > > I am not using corosync at the moment. > > At this point, most things work great: the shared services and IP > address are passed around when servers reboot or are unplugged, etc. > > However, I HAVE ONE PROBLEM: if I simultaneously reboot both servers > by typing reboot in both sessions, and then hitting ENTER in both at > about the same time, neither of the servers acquires shared services, > so they remain unprovided. > > If, after that, I reboot one of the servers again, then the unrebooted > one acquires services. What exactly am I doing wrong? > > Here is my ha.cf and haresources: > > ==>cat ha.cf > use_logd on > udpport 12694 > keepalive 1 > warntime 15 > deadtime 20 > debug 1 > initdead 60 > bcast eth1 > node pfs-srv3 > node pfs-srv4 > auto_failback on > crm off > > > ==>cat haresources > pfs-srv3 drbddisk::r0 Filesystem::/dev/drbd0::/pfs::ext3 10.1.8.45/24 > nfs-kernel-server smbd > > > > > Did you see the logs? Does HA try to start the resources on the preferred > node? Can you check what is the status reported by HB script > (/etc/init.d/heartbeat status)? > Also can you run cl_status with various arguments e.g. nodestatus, hbstatus? > You can also run the individual resource scripts with the status argument to > check what it reports (started/stopped)? > pushkar > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
