On Tue, Aug 10, 2010 at 7:05 PM, David Lang
<[email protected]> wrote:
> On Tue, 10 Aug 2010, Igor Chudov wrote:
>
>> On Tue, Aug 10, 2010 at 6:41 PM, David Lang
>> <[email protected]> wrote:
>>> On Tue, 10 Aug 2010, Igor Chudov wrote:
>>>
>>>> Guys, I have a bit of clarification. In an attempt to avoid the timing
>>>> issues, an hour ago I tried adding a configuration change to
>>>> /etc/init.d/heartbeat to delay starting it by 2 minutes on one box. So
>>>> logs with takeover succeeding, and heartbeat shutting down are partly
>>>> an artifact of this change, as things never worked like that before.
>>>> You saw this and noticed that it was different from before.
>>>>
>>>> I took that out and I am back to the exact situation I always was in
>>>> (no one takes over). Logs are at the bottom. What I do know from this
>>>> experiment, is that resource acquisition itself is unlikely to blame.
>>>>
>>>> What I see now, s back to what I saw yesterday and prior, and makes no
>>>> sense to me.
>>>
>>> nothing else shows up in the logs? I would expect the boxes to sit like 
>>> this for
>>> 40 seconds or so (2x deadtime setting IIRC, but it could be 30 sec + 
>>> deadtime)
>>> and then there would be additional log entries.
>>>
>>
>> I just checked, the machines were up since I sent the previous email
>> (42 minutes), nothing new was added to log files.
>
> but if you stop heartbeat on either box, the other becomes active?


Correct.


>>> As I noted in a prior e-mail, to work around issues where Cisco switches 
>>> won't
>>> pass any traffic for 30 seconds after the port becomes live (I think the do
>>> spanning tree detection) heartbeat sits extra long when it first boots and
>>> doesn't hear anything, just in case the switch is preventing it from seeing
>>> another system that's up.
>>>
>>
>> There is a crossover cable directly between their eth1 interfaces.
>>
>> Broadcasting happens on eth1 too (per configs that I posted, I hope
>> that I am not wrong).
>
> the startup delay happens even if you have a crossover cable. The issue is 
> that
> the systems can't know what the connecivity is, so they play it safe rather 
> than
> running the risk of causing a split brain due to the switch just not passing 
> the
> traffic soon enough.
>

But why don't they pick up when the connection is established?

> one thing in your config that I am not familiar with is the line about logd
>
> hmm, digging around on the wiki, it looks like there is a separate config file
> for it (by default /etc/logd.cf) what does it say (it could be putting the 
> logs
> we are looking for elsewhere, like in syslog)

Yes, but I think that it is not essential.

> it looks like in newer versions of heartbeat the explicit parameter initdead
> replaces the delay I was mentioning above. the file you posted has this set 
> for
> 180 seconds, so after heartbeat starts it should sit for that long before 
> doing
> anything else.

Right, but nothing good happens in 40 minutes or more.

i

> David Lang
>
>> i
>>
>>> David Lang
>>>
>>>> pfs-srv3:
>>>>
>>>>
>>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Core dumps could be lost
>>>> if multiple dumps occur.
>>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting
>>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
>>>> maximum supportability
>>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: WARN: Consider setting
>>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
>>>> supportability
>>>> Aug 10 18:04:41 pfs-srv3 logd: [955]: info: G_main_add_SignalHandler:
>>>> Added signal handler for signal 15
>>>> Aug 10 18:04:41 pfs-srv3 logd: [986]: info: G_main_add_SignalHandler:
>>>> Added signal handler for signal 15
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Enabling logging daemon
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: logfile and debug
>>>> file are those specified in logd config file (default /etc/logd.cf)
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Version 2 support: off
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: AUTH: i=1: key =
>>>> 0x88e6b30, auth=0xb7200034, authname=md5
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Core dumps could be
>>>> lost if multiple dumps occur.
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting
>>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
>>>> maximum supportability
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: WARN: Consider setting
>>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
>>>> supportability
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: 
>>>> **************************
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Configuration
>>>> validated. Starting heartbeat 3.0.2
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1179]: info: Heartbeat Hg
>>>> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: heartbeat: version 3.0.2
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Heartbeat
>>>> generation: 1279723767
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast
>>>> heartbeat started on port 12694 (12694) interface eth1
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: glib: UDP Broadcast
>>>> heartbeat closed on port 12694 interface eth1 - Status: 1
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info:
>>>> G_main_add_TriggerHandler: Added signal manual handler
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info:
>>>> G_main_add_TriggerHandler: Added signal manual handler
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info:
>>>> G_main_add_SignalHandler: Added signal handler for signal 17
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Local status now set to: 
>>>> 'up'
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv3:eth1 up.
>>>> Aug 10 18:04:43 pfs-srv3 heartbeat: [1180]: info: Managed
>>>> write_hostcachedata process 1222 exited with return code 0.
>>>> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Link pfs-srv4:eth1 up.
>>>> Aug 10 18:04:44 pfs-srv3 heartbeat: [1180]: info: Managed
>>>> write_hostcachedata process 1223 exited with return code 0.
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for
>>>> node pfs-srv4: status up
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Status update for
>>>> node pfs-srv4: status active
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Comm_now_up():
>>>> updating status to active
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Local status now set
>>>> to: 'active'
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed
>>>> write_hostcachedata process 1264 exited with return code 0.
>>>> Aug 10 18:04:45 pfs-srv3 harc[1263]: [1271]: info: Running
>>>> /etc/ha.d//rc.d/status status
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status
>>>> process 1263 exited with return code 0.
>>>> Aug 10 18:04:45 pfs-srv3 harc[1276]: [1282]: info: Running
>>>> /etc/ha.d//rc.d/status status
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed status
>>>> process 1276 exited with return code 0.
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: Managed
>>>> write_delcachedata process 1266 exited with return code 0.
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0
>>>> Aug 10 18:04:45 pfs-srv3 heartbeat: [1180]: info: STATE 1 => 3
>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: local resource
>>>> transition completed.
>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info:
>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0))
>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Initial resource
>>>> acquisition complete (T_RESOURCES(us))
>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: 1 local resources
>>>> from [/usr/share/heartbeat/ResourceManager listkeys pfs-srv3]
>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: Local Resource
>>>> acquisition completed.
>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1441]: info: FIFO message [type
>>>> resource] written rc=81
>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info:
>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1))
>>>> Aug 10 18:04:55 pfs-srv3 heartbeat: [1180]: info: Managed
>>>> req_our_resources(ask) process 1441 exited with return code 0.
>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 0
>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: remote resource
>>>> transition completed.
>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info:
>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1))
>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1
>>>> Aug 10 18:04:56 pfs-srv3 heartbeat: [1180]: info: other_holds_resources: 1
>>>>
>>>>
>>>> pfs-srv4:
>>>>
>>>>
>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: logd started with /etc/logd.cf.
>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Core dumps could be lost
>>>> if multiple dumps occur.
>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting
>>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
>>>> maximum supportability
>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: WARN: Consider setting
>>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
>>>> supportability
>>>> Aug 10 18:04:43 pfs-srv4 logd: [899]: info: G_main_add_SignalHandler:
>>>> Added signal handler for signal 15
>>>> Aug 10 18:04:43 pfs-srv4 logd: [909]: info: G_main_add_SignalHandler:
>>>> Added signal handler for signal 15
>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Enabling logging daemon
>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: logfile and debug
>>>> file are those specified in logd config file (default /etc/logd.cf)
>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: Version 2 support: off
>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: info: AUTH: i=1: key =
>>>> 0x9960ac8, auth=0xb7147034, authname=md5
>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Core dumps could be
>>>> lost if multiple dumps occur.
>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting
>>>> non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
>>>> maximum supportability
>>>> Aug 10 18:04:43 pfs-srv4 heartbeat: [1161]: WARN: Consider setting
>>>> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
>>>> supportability
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: 
>>>> **************************
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Configuration
>>>> validated. Starting heartbeat 3.0.2
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1161]: info: Heartbeat Hg
>>>> Version: node: ed844d11ea2b603f7d01cce1700d6c1fcb404d29
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: heartbeat: version 3.0.2
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Heartbeat
>>>> generation: 1279723774
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast
>>>> heartbeat started on port 12694 (12694) interface eth1
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: glib: UDP Broadcast
>>>> heartbeat closed on port 12694 interface eth1 - Status: 1
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info:
>>>> G_main_add_TriggerHandler: Added signal manual handler
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info:
>>>> G_main_add_TriggerHandler: Added signal manual handler
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info:
>>>> G_main_add_SignalHandler: Added signal handler for signal 17
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Local status now set to: 
>>>> 'up'
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv4:eth1 up.
>>>> Aug 10 18:04:44 pfs-srv4 heartbeat: [1162]: info: Managed
>>>> write_hostcachedata process 1191 exited with return code 0.
>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Link pfs-srv3:eth1 up.
>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Status update for
>>>> node pfs-srv3: status up
>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed
>>>> write_hostcachedata process 1193 exited with return code 0.
>>>> Aug 10 18:04:45 pfs-srv4 harc[1192]: [1199]: info: Running
>>>> /etc/ha.d//rc.d/status status
>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed status
>>>> process 1192 exited with return code 0.
>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Comm_now_up():
>>>> updating status to active
>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Local status now set
>>>> to: 'active'
>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed
>>>> write_hostcachedata process 1204 exited with return code 0.
>>>> Aug 10 18:04:45 pfs-srv4 heartbeat: [1162]: info: Managed
>>>> write_delcachedata process 1205 exited with return code 0.
>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Status update for
>>>> node pfs-srv3: status active
>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info:
>>>> AnnounceTakeover(local 0, foreign 1, reason 'HB_R_BOTHSTARTING' (0))
>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 1 => 3
>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 2
>>>> Aug 10 18:04:46 pfs-srv4 harc[1213]: [1219]: info: Running
>>>> /etc/ha.d//rc.d/status status
>>>> Aug 10 18:04:46 pfs-srv4 heartbeat: [1162]: info: Managed status
>>>> process 1213 exited with return code 0.
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource
>>>> transition completed.
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 2 => 3
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: remote resource
>>>> transition completed.
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info:
>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0))
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Initial resource
>>>> acquisition complete (T_RESOURCES(us))
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info:
>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(them)' (1))
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: STATE 3 => 4
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: No local resources
>>>> [/usr/share/heartbeat/ResourceManager listkeys pfs-srv4] to acquire.
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1298]: info: FIFO message [type
>>>> resource] written rc=81
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info:
>>>> AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1))
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: Managed
>>>> req_our_resources(ask) process 1298 exited with return code 0.
>>>> Aug 10 18:04:56 pfs-srv4 heartbeat: [1162]: info: other_holds_resources: 1
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to