On Tue, 10 Aug 2010, Igor Chudov wrote:

> On Tue, Aug 10, 2010 at 6:41 PM, David Lang
> <[email protected]> wrote:
>> On Tue, 10 Aug 2010, Igor Chudov wrote:
>>
>>> Guys, I have a bit of clarification. In an attempt to avoid the timing
>>> issues, an hour ago I tried adding a configuration change to
>>> /etc/init.d/heartbeat to delay starting it by 2 minutes on one box. So
>>> logs with takeover succeeding, and heartbeat shutting down are partly
>>> an artifact of this change, as things never worked like that before.
>>> You saw this and noticed that it was different from before.
>>>
>>> I took that out and I am back to the exact situation I always was in
>>> (no one takes over). Logs are at the bottom. What I do know from this
>>> experiment, is that resource acquisition itself is unlikely to blame.
>>>
>>> What I see now, s back to what I saw yesterday and prior, and makes no
>>> sense to me.
>>
>> nothing else shows up in the logs? I would expect the boxes to sit like this 
>> for
>> 40 seconds or so (2x deadtime setting IIRC, but it could be 30 sec + 
>> deadtime)
>> and then there would be additional log entries.
>>
>
> I just checked, the machines were up since I sent the previous email
> (42 minutes), nothing new was added to log files.

but if you stop heartbeat on either box, the other becomes active?

>> As I noted in a prior e-mail, to work around issues where Cisco switches 
>> won't
>> pass any traffic for 30 seconds after the port becomes live (I think the do
>> spanning tree detection) heartbeat sits extra long when it first boots and
>> doesn't hear anything, just in case the switch is preventing it from seeing
>> another system that's up.
>>
>
> There is a crossover cable directly between their eth1 interfaces.
>
> Broadcasting happens on eth1 too (per configs that I posted, I hope
> that I am not wrong).

the startup delay happens even if you have a crossover cable. The issue is that 
the systems can't know what the connecivity is, so they play it safe rather 
than 
running the risk of causing a split brain due to the switch just not passing 
the 
traffic soon enough.


one thing in your config that I am not familiar with is the line about logd

hmm, digging around on the wiki, it looks like there is a separate config file 
for it (by default /etc/logd.cf) what does it say (it could be putting the logs 
we are looking for elsewhere, like in syslog)

it looks like in newer versions of heartbeat the explicit parameter initdead 
replaces the delay I was mentioning above. the file you posted has this set for 
180 seconds, so after heartbeat starts it should sit for that long before doing 
anything else.

David Lang

> i
>
>> David Lang
>>
>>> pfs-srv3:
>>>
>>>
>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Core dumps could be lost
>>> if multiple dumps occur.
>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting
>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
>>> maximum supportability
>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting
>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
>>> supportability
>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: info: G_main_add_SignalHandler:
>>> Added signal handler for signal 15
>>> Aug 10 18:04:41 pfs-srv3 logd: [986]: info: G_main_add_SignalHandler:
>>> Added signal handler for signal 15
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Enabling logging daemon
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: logfile and debug
>>> file are those specified in logd config file (default /etc/logd.cf)
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Version 2 support: off
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: AUTH: i=1: key =
>>> 0x88e6b30, auth=0xb7200034, authname=md5
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Core dumps could be
>>> lost if multiple dumps occur.
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting
>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
>>> maximum supportability
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting
>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
>>> supportability
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: **************************
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Configuration
>>> validated. Starting heartbeat 3.0.2
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Heartbeat Hg
>>> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: heartbeat: version 3.0.2
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Heartbeat
>>> generation: 1279723767
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast
>>> heartbeat started on port 12694 (12694) interface eth1
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast
>>> heartbeat closed on port 12694 interface eth1 - Status: 1
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info:
>>> G_main_add_TriggerHandler: Added signal manual handler
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info:
>>> G_main_add_TriggerHandler: Added signal manual handler
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info:
>>> G_main_add_SignalHandler: Added signal handler for signal 17
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Local status now set to: 
>>> 'up'
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv3:eth1 up.
>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Managed
>>> write_hostcachedata process 1222 exited with return code 0.
>>> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv4:eth1 up.
>>> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Managed
>>> write_hostcachedata process 1223 exited with return code 0.
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for
>>> node pfs-srv4: status up
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for
>>> node pfs-srv4: status active
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Comm_now_up():
>>> updating status to active
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Local status now set
>>> to: 'active'
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed
>>> write_hostcachedata process 1264 exited with return code 0.
>>> Aug 10 18:04:45 pfs-srv3 harc[1263]: [1271]: info: Running
>>> /etc/ha.d//rc.d/status status
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status
>>> process 1263 exited with return code 0.
>>> Aug 10 18:04:45 pfs-srv3 harc[1276]: [1282]: info: Running
>>> /etc/ha.d//rc.d/status status
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status
>>> process 1276 exited with return code 0.
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed
>>> write_delcachedata process 1266 exited with return code 0.
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0
>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: STATE 1 => 3
>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: local resource
>>> transition completed.
>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info:
>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0))
>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Initial resource
>>> acquisition complete (T_RESOURCES(us))
>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: 1 local resources
>>> from [/usr/share/heartbeat/ResourceManager listkeys pfs-srv3]
>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: Local Resource
>>> acquisition completed.
>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: FIFO message [type
>>> resource] written rc=81
>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info:
>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1))
>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Managed
>>> req_our_resources(ask) process 1441 exited with return code 0.
>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0
>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: remote resource
>>> transition completed.
>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info:
>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1))
>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1
>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1
>>>
>>>
>>> pfs-srv4:
>>>
>>>
>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: logd started with /etc/logd.cf.
>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Core dumps could be lost
>>> if multiple dumps occur.
>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting
>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
>>> maximum supportability
>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting
>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
>>> supportability
>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: G_main_add_SignalHandler:
>>> Added signal handler for signal 15
>>> Aug 10 18:04:43 pfs-srv4 logd: [909]: info: G_main_add_SignalHandler:
>>> Added signal handler for signal 15
>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Enabling logging daemon
>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: logfile and debug
>>> file are those specified in logd config file (default /etc/logd.cf)
>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Version 2 support: off
>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: AUTH: i=1: key =
>>> 0x9960ac8, auth=0xb7147034, authname=md5
>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Core dumps could be
>>> lost if multiple dumps occur.
>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting
>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
>>> maximum supportability
>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting
>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
>>> supportability
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: **************************
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Configuration
>>> validated. Starting heartbeat 3.0.2
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Heartbeat Hg
>>> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: heartbeat: version 3.0.2
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Heartbeat
>>> generation: 1279723774
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast
>>> heartbeat started on port 12694 (12694) interface eth1
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast
>>> heartbeat closed on port 12694 interface eth1 - Status: 1
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info:
>>> G_main_add_TriggerHandler: Added signal manual handler
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info:
>>> G_main_add_TriggerHandler: Added signal manual handler
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info:
>>> G_main_add_SignalHandler: Added signal handler for signal 17
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Local status now set to: 
>>> 'up'
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv4:eth1 up.
>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Managed
>>> write_hostcachedata process 1191 exited with return code 0.
>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv3:eth1 up.
>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Status update for
>>> node pfs-srv3: status up
>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed
>>> write_hostcachedata process 1193 exited with return code 0.
>>> Aug 10 18:04:45 pfs-srv4 harc[1192]: [1199]: info: Running
>>> /etc/ha.d//rc.d/status status
>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed status
>>> process 1192 exited with return code 0.
>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Comm_now_up():
>>> updating status to active
>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Local status now set
>>> to: 'active'
>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed
>>> write_hostcachedata process 1204 exited with return code 0.
>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed
>>> write_delcachedata process 1205 exited with return code 0.
>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Status update for
>>> node pfs-srv3: status active
>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info:
>>> AnnounceTakeover(local 0, foreign 1, reason 'HB_R_BOTHSTARTING' (0))
>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 1 => 3
>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 2
>>> Aug 10 18:04:46 pfs-srv4 harc[1213]: [1219]: info: Running
>>> /etc/ha.d//rc.d/status status
>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Managed status
>>> process 1213 exited with return code 0.
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource
>>> transition completed.
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 2 => 3
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource
>>> transition completed.
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info:
>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0))
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Initial resource
>>> acquisition complete (T_RESOURCES(us))
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info:
>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(them)' (1))
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 4
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: No local resources
>>> [/usr/share/heartbeat/ResourceManager listkeys pfs-srv4] to acquire.
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: FIFO message [type
>>> resource] written rc=81
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info:
>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1))
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Managed
>>> req_our_resources(ask) process 1298 exited with return code 0.
>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to