On Tue, Aug 10, 2010 at 7:05 PM, David Lang <[email protected]> wrote: > On Tue, 10 Aug 2010, Igor Chudov wrote: > >> On Tue, Aug 10, 2010 at 6:41 PM, David Lang >> <[email protected]> wrote: >>> On Tue, 10 Aug 2010, Igor Chudov wrote: >>> >>>> Guys, I have a bit of clarification. In an attempt to avoid the timing >>>> issues, an hour ago I tried adding a configuration change to >>>> /etc/init.d/heartbeat to delay starting it by 2 minutes on one box. So >>>> logs with takeover succeeding, and heartbeat shutting down are partly >>>> an artifact of this change, as things never worked like that before. >>>> You saw this and noticed that it was different from before. >>>> >>>> I took that out and I am back to the exact situation I always was in >>>> (no one takes over). Logs are at the bottom. What I do know from this >>>> experiment, is that resource acquisition itself is unlikely to blame. >>>> >>>> What I see now, s back to what I saw yesterday and prior, and makes no >>>> sense to me. >>> >>> nothing else shows up in the logs? I would expect the boxes to sit like >>> this for >>> 40 seconds or so (2x deadtime setting IIRC, but it could be 30 sec + >>> deadtime) >>> and then there would be additional log entries. >>> >> >> I just checked, the machines were up since I sent the previous email >> (42 minutes), nothing new was added to log files. > > but if you stop heartbeat on either box, the other becomes active?
Correct. >>> As I noted in a prior e-mail, to work around issues where Cisco switches >>> won't >>> pass any traffic for 30 seconds after the port becomes live (I think the do >>> spanning tree detection) heartbeat sits extra long when it first boots and >>> doesn't hear anything, just in case the switch is preventing it from seeing >>> another system that's up. >>> >> >> There is a crossover cable directly between their eth1 interfaces. >> >> Broadcasting happens on eth1 too (per configs that I posted, I hope >> that I am not wrong). > > the startup delay happens even if you have a crossover cable. The issue is > that > the systems can't know what the connecivity is, so they play it safe rather > than > running the risk of causing a split brain due to the switch just not passing > the > traffic soon enough. > But why don't they pick up when the connection is established? > one thing in your config that I am not familiar with is the line about logd > > hmm, digging around on the wiki, it looks like there is a separate config file > for it (by default /etc/logd.cf) what does it say (it could be putting the > logs > we are looking for elsewhere, like in syslog) Yes, but I think that it is not essential. > it looks like in newer versions of heartbeat the explicit parameter initdead > replaces the delay I was mentioning above. the file you posted has this set > for > 180 seconds, so after heartbeat starts it should sit for that long before > doing > anything else. Right, but nothing good happens in 40 minutes or more. i > David Lang > >> i >> >>> David Lang >>> >>>> pfs-srv3: >>>> >>>> >>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Core dumps could be lost >>>> if multiple dumps occur. >>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting >>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >>>> maximum supportability >>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting >>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >>>> supportability >>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: info: G_main_add_SignalHandler: >>>> Added signal handler for signal 15 >>>> Aug 10 18:04:41 pfs-srv3 logd: [986]: info: G_main_add_SignalHandler: >>>> Added signal handler for signal 15 >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Enabling logging daemon >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: logfile and debug >>>> file are those specified in logd config file (default /etc/logd.cf) >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Version 2 support: off >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: AUTH: i=1: key = >>>> 0x88e6b30, auth=0xb7200034, authname=md5 >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Core dumps could be >>>> lost if multiple dumps occur. >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting >>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >>>> maximum supportability >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting >>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >>>> supportability >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: >>>> ************************** >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Configuration >>>> validated. Starting heartbeat 3.0.2 >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Heartbeat Hg >>>> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29 >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: heartbeat: version 3.0.2 >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Heartbeat >>>> generation: 1279723767 >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast >>>> heartbeat started on port 12694 (12694) interface eth1 >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast >>>> heartbeat closed on port 12694 interface eth1 - Status: 1 >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: >>>> G_main_add_TriggerHandler: Added signal manual handler >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: >>>> G_main_add_TriggerHandler: Added signal manual handler >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: >>>> G_main_add_SignalHandler: Added signal handler for signal 17 >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Local status now set to: >>>> 'up' >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv3:eth1 up. >>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Managed >>>> write_hostcachedata process 1222 exited with return code 0. >>>> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv4:eth1 up. >>>> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Managed >>>> write_hostcachedata process 1223 exited with return code 0. >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for >>>> node pfs-srv4: status up >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for >>>> node pfs-srv4: status active >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Comm_now_up(): >>>> updating status to active >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Local status now set >>>> to: 'active' >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed >>>> write_hostcachedata process 1264 exited with return code 0. >>>> Aug 10 18:04:45 pfs-srv3 harc[1263]: [1271]: info: Running >>>> /etc/ha.d//rc.d/status status >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status >>>> process 1263 exited with return code 0. >>>> Aug 10 18:04:45 pfs-srv3 harc[1276]: [1282]: info: Running >>>> /etc/ha.d//rc.d/status status >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status >>>> process 1276 exited with return code 0. >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed >>>> write_delcachedata process 1266 exited with return code 0. >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0 >>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: STATE 1 => 3 >>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: local resource >>>> transition completed. >>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: >>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0)) >>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Initial resource >>>> acquisition complete (T_RESOURCES(us)) >>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: 1 local resources >>>> from [/usr/share/heartbeat/ResourceManager listkeys pfs-srv3] >>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: Local Resource >>>> acquisition completed. >>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: FIFO message [type >>>> resource] written rc=81 >>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: >>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) >>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Managed >>>> req_our_resources(ask) process 1441 exited with return code 0. >>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0 >>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: remote resource >>>> transition completed. >>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: >>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) >>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1 >>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1 >>>> >>>> >>>> pfs-srv4: >>>> >>>> >>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: logd started with /etc/logd.cf. >>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Core dumps could be lost >>>> if multiple dumps occur. >>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting >>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >>>> maximum supportability >>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting >>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >>>> supportability >>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: G_main_add_SignalHandler: >>>> Added signal handler for signal 15 >>>> Aug 10 18:04:43 pfs-srv4 logd: [909]: info: G_main_add_SignalHandler: >>>> Added signal handler for signal 15 >>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Enabling logging daemon >>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: logfile and debug >>>> file are those specified in logd config file (default /etc/logd.cf) >>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Version 2 support: off >>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: AUTH: i=1: key = >>>> 0x9960ac8, auth=0xb7147034, authname=md5 >>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Core dumps could be >>>> lost if multiple dumps occur. >>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting >>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >>>> maximum supportability >>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting >>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >>>> supportability >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: >>>> ************************** >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Configuration >>>> validated. Starting heartbeat 3.0.2 >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Heartbeat Hg >>>> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29 >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: heartbeat: version 3.0.2 >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Heartbeat >>>> generation: 1279723774 >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast >>>> heartbeat started on port 12694 (12694) interface eth1 >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast >>>> heartbeat closed on port 12694 interface eth1 - Status: 1 >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: >>>> G_main_add_TriggerHandler: Added signal manual handler >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: >>>> G_main_add_TriggerHandler: Added signal manual handler >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: >>>> G_main_add_SignalHandler: Added signal handler for signal 17 >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Local status now set to: >>>> 'up' >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv4:eth1 up. >>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Managed >>>> write_hostcachedata process 1191 exited with return code 0. >>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv3:eth1 up. >>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Status update for >>>> node pfs-srv3: status up >>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed >>>> write_hostcachedata process 1193 exited with return code 0. >>>> Aug 10 18:04:45 pfs-srv4 harc[1192]: [1199]: info: Running >>>> /etc/ha.d//rc.d/status status >>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed status >>>> process 1192 exited with return code 0. >>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Comm_now_up(): >>>> updating status to active >>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Local status now set >>>> to: 'active' >>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed >>>> write_hostcachedata process 1204 exited with return code 0. >>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed >>>> write_delcachedata process 1205 exited with return code 0. >>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Status update for >>>> node pfs-srv3: status active >>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: >>>> AnnounceTakeover(local 0, foreign 1, reason 'HB_R_BOTHSTARTING' (0)) >>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 1 => 3 >>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 2 >>>> Aug 10 18:04:46 pfs-srv4 harc[1213]: [1219]: info: Running >>>> /etc/ha.d//rc.d/status status >>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Managed status >>>> process 1213 exited with return code 0. >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource >>>> transition completed. >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 2 => 3 >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1 >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource >>>> transition completed. >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: >>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0)) >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Initial resource >>>> acquisition complete (T_RESOURCES(us)) >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: >>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(them)' (1)) >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 4 >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1 >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: No local resources >>>> [/usr/share/heartbeat/ResourceManager listkeys pfs-srv4] to acquire. >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: FIFO message [type >>>> resource] written rc=81 >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: >>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Managed >>>> req_our_resources(ask) process 1298 exited with return code 0. >>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1 >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> [email protected] >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >>>> >>> _______________________________________________ >>> Linux-HA mailing list >>> [email protected] >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >>> >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
