On Tue, 10 Aug 2010, Igor Chudov wrote: > On Tue, Aug 10, 2010 at 7:05 PM, David Lang > <[email protected]> wrote: >> On Tue, 10 Aug 2010, Igor Chudov wrote: >> >>> On Tue, Aug 10, 2010 at 6:41 PM, David Lang >>> <[email protected]> wrote: >>>> On Tue, 10 Aug 2010, Igor Chudov wrote: >>>> As I noted in a prior e-mail, to work around issues where Cisco switches >>>> won't >>>> pass any traffic for 30 seconds after the port becomes live (I think the do >>>> spanning tree detection) heartbeat sits extra long when it first boots and >>>> doesn't hear anything, just in case the switch is preventing it from seeing >>>> another system that's up. >>>> >>> >>> There is a crossover cable directly between their eth1 interfaces. >>> >>> Broadcasting happens on eth1 too (per configs that I posted, I hope >>> that I am not wrong). >> >> the startup delay happens even if you have a crossover cable. The issue is >> that >> the systems can't know what the connecivity is, so they play it safe rather >> than >> running the risk of causing a split brain due to the switch just not passing >> the >> traffic soon enough. >> > > But why don't they pick up when the connection is established?
historicaly it was good enough to just wait a little bit, and avoided all sorts of logic for figuring out if you really saw everyone or not. >> one thing in your config that I am not familiar with is the line about logd >> >> hmm, digging around on the wiki, it looks like there is a separate config >> file >> for it (by default /etc/logd.cf) what does it say (it could be putting the >> logs >> we are looking for elsewhere, like in syslog) > > Yes, but I think that it is not essential. could you double check it? the fact that you are seeing _no_ additional logs is concerning. I'm wondering if the logs are going elsewhere. Every few hours heartbeat writes some status messages to it's log, if you are seeing _nothing_ after the initial startup it makes me suspicious that the logs are going elsewhere. especially since when you did the delayed start we saw the resource messages for the initial boot (when it did a stop on every resource) before it ever attempted to go active, the fact that we don't see those messages now is odd. David Lang >> it looks like in newer versions of heartbeat the explicit parameter initdead >> replaces the delay I was mentioning above. the file you posted has this set >> for >> 180 seconds, so after heartbeat starts it should sit for that long before >> doing >> anything else. > > Right, but nothing good happens in 40 minutes or more. > > i > >> David Lang >> >>> i >>> >>>> David Lang >>>> >>>>> pfs-srv3: >>>>> >>>>> >>>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Core dumps could be lost >>>>> if multiple dumps occur. >>>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting >>>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >>>>> maximum supportability >>>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting >>>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >>>>> supportability >>>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: info: G_main_add_SignalHandler: >>>>> Added signal handler for signal 15 >>>>> Aug 10 18:04:41 pfs-srv3 logd: [986]: info: G_main_add_SignalHandler: >>>>> Added signal handler for signal 15 >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Enabling logging daemon >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: logfile and debug >>>>> file are those specified in logd config file (default /etc/logd.cf) >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Version 2 support: off >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: AUTH: i=1: key = >>>>> 0x88e6b30, auth=0xb7200034, authname=md5 >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Core dumps could be >>>>> lost if multiple dumps occur. >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting >>>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >>>>> maximum supportability >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting >>>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >>>>> supportability >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: >>>>> ************************** >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Configuration >>>>> validated. Starting heartbeat 3.0.2 >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Heartbeat Hg >>>>> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29 >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: heartbeat: version 3.0.2 >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Heartbeat >>>>> generation: 1279723767 >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast >>>>> heartbeat started on port 12694 (12694) interface eth1 >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast >>>>> heartbeat closed on port 12694 interface eth1 - Status: 1 >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: >>>>> G_main_add_TriggerHandler: Added signal manual handler >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: >>>>> G_main_add_TriggerHandler: Added signal manual handler >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: >>>>> G_main_add_SignalHandler: Added signal handler for signal 17 >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Local status now set >>>>> to: 'up' >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv3:eth1 up. >>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Managed >>>>> write_hostcachedata process 1222 exited with return code 0. >>>>> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv4:eth1 up. >>>>> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Managed >>>>> write_hostcachedata process 1223 exited with return code 0. >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for >>>>> node pfs-srv4: status up >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for >>>>> node pfs-srv4: status active >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Comm_now_up(): >>>>> updating status to active >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Local status now set >>>>> to: 'active' >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed >>>>> write_hostcachedata process 1264 exited with return code 0. >>>>> Aug 10 18:04:45 pfs-srv3 harc[1263]: [1271]: info: Running >>>>> /etc/ha.d//rc.d/status status >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status >>>>> process 1263 exited with return code 0. >>>>> Aug 10 18:04:45 pfs-srv3 harc[1276]: [1282]: info: Running >>>>> /etc/ha.d//rc.d/status status >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status >>>>> process 1276 exited with return code 0. >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed >>>>> write_delcachedata process 1266 exited with return code 0. >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0 >>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: STATE 1 => 3 >>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: local resource >>>>> transition completed. >>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: >>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0)) >>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Initial resource >>>>> acquisition complete (T_RESOURCES(us)) >>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: 1 local resources >>>>> from [/usr/share/heartbeat/ResourceManager listkeys pfs-srv3] >>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: Local Resource >>>>> acquisition completed. >>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: FIFO message [type >>>>> resource] written rc=81 >>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: >>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) >>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Managed >>>>> req_our_resources(ask) process 1441 exited with return code 0. >>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0 >>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: remote resource >>>>> transition completed. >>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: >>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) >>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1 >>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1 >>>>> >>>>> >>>>> pfs-srv4: >>>>> >>>>> >>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: logd started with >>>>> /etc/logd.cf. >>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Core dumps could be lost >>>>> if multiple dumps occur. >>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting >>>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >>>>> maximum supportability >>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting >>>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >>>>> supportability >>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: G_main_add_SignalHandler: >>>>> Added signal handler for signal 15 >>>>> Aug 10 18:04:43 pfs-srv4 logd: [909]: info: G_main_add_SignalHandler: >>>>> Added signal handler for signal 15 >>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Enabling logging daemon >>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: logfile and debug >>>>> file are those specified in logd config file (default /etc/logd.cf) >>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Version 2 support: off >>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: AUTH: i=1: key = >>>>> 0x9960ac8, auth=0xb7147034, authname=md5 >>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Core dumps could be >>>>> lost if multiple dumps occur. >>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting >>>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for >>>>> maximum supportability >>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting >>>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum >>>>> supportability >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: >>>>> ************************** >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Configuration >>>>> validated. Starting heartbeat 3.0.2 >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Heartbeat Hg >>>>> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29 >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: heartbeat: version 3.0.2 >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Heartbeat >>>>> generation: 1279723774 >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast >>>>> heartbeat started on port 12694 (12694) interface eth1 >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast >>>>> heartbeat closed on port 12694 interface eth1 - Status: 1 >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: >>>>> G_main_add_TriggerHandler: Added signal manual handler >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: >>>>> G_main_add_TriggerHandler: Added signal manual handler >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: >>>>> G_main_add_SignalHandler: Added signal handler for signal 17 >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Local status now set >>>>> to: 'up' >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv4:eth1 up. >>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Managed >>>>> write_hostcachedata process 1191 exited with return code 0. >>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv3:eth1 up. >>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Status update for >>>>> node pfs-srv3: status up >>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed >>>>> write_hostcachedata process 1193 exited with return code 0. >>>>> Aug 10 18:04:45 pfs-srv4 harc[1192]: [1199]: info: Running >>>>> /etc/ha.d//rc.d/status status >>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed status >>>>> process 1192 exited with return code 0. >>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Comm_now_up(): >>>>> updating status to active >>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Local status now set >>>>> to: 'active' >>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed >>>>> write_hostcachedata process 1204 exited with return code 0. >>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed >>>>> write_delcachedata process 1205 exited with return code 0. >>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Status update for >>>>> node pfs-srv3: status active >>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: >>>>> AnnounceTakeover(local 0, foreign 1, reason 'HB_R_BOTHSTARTING' (0)) >>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 1 => 3 >>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 2 >>>>> Aug 10 18:04:46 pfs-srv4 harc[1213]: [1219]: info: Running >>>>> /etc/ha.d//rc.d/status status >>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Managed status >>>>> process 1213 exited with return code 0. >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource >>>>> transition completed. >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 2 => 3 >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1 >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource >>>>> transition completed. >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: >>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0)) >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Initial resource >>>>> acquisition complete (T_RESOURCES(us)) >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: >>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(them)' (1)) >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 4 >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1 >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: No local resources >>>>> [/usr/share/heartbeat/ResourceManager listkeys pfs-srv4] to acquire. >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: FIFO message [type >>>>> resource] written rc=81 >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: >>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Managed >>>>> req_our_resources(ask) process 1298 exited with return code 0. >>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1 >>>>> _______________________________________________ >>>>> Linux-HA mailing list >>>>> [email protected] >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>> See also: http://linux-ha.org/ReportingProblems >>>>> >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> [email protected] >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >>>> >>> _______________________________________________ >>> Linux-HA mailing list >>> [email protected] >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >>> >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
