On Tue, Aug 10, 2010 at 6:41 PM, David Lang <[email protected]> wrote: > On Tue, 10 Aug 2010, Igor Chudov wrote: > >> Guys, I have a bit of clarification. In an attempt to avoid the timing >> issues, an hour ago I tried adding a configuration change to >> /etc/init.d/heartbeat to delay starting it by 2 minutes on one box. So >> logs with takeover succeeding, and heartbeat shutting down are partly >> an artifact of this change, as things never worked like that before. >> You saw this and noticed that it was different from before. >> >> I took that out and I am back to the exact situation I always was in >> (no one takes over). Logs are at the bottom. What I do know from this >> experiment, is that resource acquisition itself is unlikely to blame. >> >> What I see now, s back to what I saw yesterday and prior, and makes no >> sense to me. > > nothing else shows up in the logs? I would expect the boxes to sit like this > for > 40 seconds or so (2x deadtime setting IIRC, but it could be 30 sec + deadtime) > and then there would be additional log entries. >
I just checked, the machines were up since I sent the previous email (42 minutes), nothing new was added to log files. > As I noted in a prior e-mail, to work around issues where Cisco switches won't > pass any traffic for 30 seconds after the port becomes live (I think the do > spanning tree detection) heartbeat sits extra long when it first boots and > doesn't hear anything, just in case the switch is preventing it from seeing > another system that's up. > There is a crossover cable directly between their eth1 interfaces. Broadcasting happens on eth1 too (per configs that I posted, I hope that I am not wrong). i > David Lang > >> pfs-srv3: >> >> >> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Core dumps could be lost >> if multiple dumps occur. >> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting >> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >> maximum supportability >> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting >> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >> supportability >> Aug 10 18:04:41 pfs-srv3 logd: [955]: info: G_main_add_SignalHandler: >> Added signal handler for signal 15 >> Aug 10 18:04:41 pfs-srv3 logd: [986]: info: G_main_add_SignalHandler: >> Added signal handler for signal 15 >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Enabling logging daemon >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: logfile and debug >> file are those specified in logd config file (default /etc/logd.cf) >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Version 2 support: off >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: AUTH: i=1: key = >> 0x88e6b30, auth=0xb7200034, authname=md5 >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Core dumps could be >> lost if multiple dumps occur. >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting >> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >> maximum supportability >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting >> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >> supportability >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: ************************** >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Configuration >> validated. Starting heartbeat 3.0.2 >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Heartbeat Hg >> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29 >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: heartbeat: version 3.0.2 >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Heartbeat >> generation: 1279723767 >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast >> heartbeat started on port 12694 (12694) interface eth1 >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast >> heartbeat closed on port 12694 interface eth1 - Status: 1 >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: >> G_main_add_TriggerHandler: Added signal manual handler >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: >> G_main_add_TriggerHandler: Added signal manual handler >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: >> G_main_add_SignalHandler: Added signal handler for signal 17 >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Local status now set to: >> 'up' >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv3:eth1 up. >> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Managed >> write_hostcachedata process 1222 exited with return code 0. >> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv4:eth1 up. >> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Managed >> write_hostcachedata process 1223 exited with return code 0. >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for >> node pfs-srv4: status up >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for >> node pfs-srv4: status active >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Comm_now_up(): >> updating status to active >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Local status now set >> to: 'active' >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed >> write_hostcachedata process 1264 exited with return code 0. >> Aug 10 18:04:45 pfs-srv3 harc[1263]: [1271]: info: Running >> /etc/ha.d//rc.d/status status >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status >> process 1263 exited with return code 0. >> Aug 10 18:04:45 pfs-srv3 harc[1276]: [1282]: info: Running >> /etc/ha.d//rc.d/status status >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status >> process 1276 exited with return code 0. >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed >> write_delcachedata process 1266 exited with return code 0. >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0 >> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: STATE 1 => 3 >> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: local resource >> transition completed. >> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: >> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0)) >> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Initial resource >> acquisition complete (T_RESOURCES(us)) >> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: 1 local resources >> from [/usr/share/heartbeat/ResourceManager listkeys pfs-srv3] >> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: Local Resource >> acquisition completed. >> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: FIFO message [type >> resource] written rc=81 >> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: >> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) >> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Managed >> req_our_resources(ask) process 1441 exited with return code 0. >> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0 >> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: remote resource >> transition completed. >> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: >> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) >> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1 >> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1 >> >> >> pfs-srv4: >> >> >> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: logd started with /etc/logd.cf. >> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Core dumps could be lost >> if multiple dumps occur. >> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting >> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >> maximum supportability >> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting >> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >> supportability >> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: G_main_add_SignalHandler: >> Added signal handler for signal 15 >> Aug 10 18:04:43 pfs-srv4 logd: [909]: info: G_main_add_SignalHandler: >> Added signal handler for signal 15 >> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Enabling logging daemon >> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: logfile and debug >> file are those specified in logd config file (default /etc/logd.cf) >> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Version 2 support: off >> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: AUTH: i=1: key = >> 0x9960ac8, auth=0xb7147034, authname=md5 >> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Core dumps could be >> lost if multiple dumps occur. >> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting >> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >> maximum supportability >> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting >> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >> supportability >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: ************************** >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Configuration >> validated. Starting heartbeat 3.0.2 >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Heartbeat Hg >> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29 >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: heartbeat: version 3.0.2 >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Heartbeat >> generation: 1279723774 >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast >> heartbeat started on port 12694 (12694) interface eth1 >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast >> heartbeat closed on port 12694 interface eth1 - Status: 1 >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: >> G_main_add_TriggerHandler: Added signal manual handler >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: >> G_main_add_TriggerHandler: Added signal manual handler >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: >> G_main_add_SignalHandler: Added signal handler for signal 17 >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Local status now set to: >> 'up' >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv4:eth1 up. >> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Managed >> write_hostcachedata process 1191 exited with return code 0. >> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv3:eth1 up. >> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Status update for >> node pfs-srv3: status up >> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed >> write_hostcachedata process 1193 exited with return code 0. >> Aug 10 18:04:45 pfs-srv4 harc[1192]: [1199]: info: Running >> /etc/ha.d//rc.d/status status >> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed status >> process 1192 exited with return code 0. >> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Comm_now_up(): >> updating status to active >> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Local status now set >> to: 'active' >> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed >> write_hostcachedata process 1204 exited with return code 0. >> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed >> write_delcachedata process 1205 exited with return code 0. >> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Status update for >> node pfs-srv3: status active >> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: >> AnnounceTakeover(local 0, foreign 1, reason 'HB_R_BOTHSTARTING' (0)) >> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 1 => 3 >> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 2 >> Aug 10 18:04:46 pfs-srv4 harc[1213]: [1219]: info: Running >> /etc/ha.d//rc.d/status status >> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Managed status >> process 1213 exited with return code 0. >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource >> transition completed. >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 2 => 3 >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1 >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource >> transition completed. >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: >> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0)) >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Initial resource >> acquisition complete (T_RESOURCES(us)) >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: >> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(them)' (1)) >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 4 >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1 >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: No local resources >> [/usr/share/heartbeat/ResourceManager listkeys pfs-srv4] to acquire. >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: FIFO message [type >> resource] written rc=81 >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: >> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Managed >> req_our_resources(ask) process 1298 exited with return code 0. >> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1 >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
