On Fri, Jul 22, 2011 at 10:18:18AM +0900, [email protected] wrote:
> Hi Lars,
> 
> Thank you for comment.
> 
> > You get a *node* active.
> > Why do you think this is wrong?
> > Which timing would have been "proper", and why?
> 
> When I examined it before, I changed a source and obtained the following 
> result.
> 
> I synchronized at the time of each node and took log.
> 
> It is 16:44:41 that srv01 node processed F_STATUS message of active.
> ----------------------------------------------------------------
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi 
> send_cluster_msg() : add_controls ###
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 12 
> fields
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : 
> [(1)srcuuid=0x9f292d8(36 27)]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 
> 2/73 14132]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 
> ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi HBDoMsg_T_STATUS 
> RECV : heartbeat_monitor NOCHANGE
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 12 
> fields
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : 
> [(1)srcuuid=0x9f292d8(36 27)]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 
> 2/73 14132]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 
> ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
> Jun  8 16:44:41 srv01 heartbeat: [14110]: info: Local status now set to: 
> 'active'
> ----------------------------------------------------------------
> 
> But, it is 16:47:04 that srv02 node received F_STATUS message.
> ----------------------------------------------------------------
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### 
> T_STATUS
> Jun  8 16:47:04 srv02 heartbeat: [6690]: info: MSG[10] : [ttl=3]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG: Dumping message with 
> 12 fields
> Jun  8 16:47:04 srv02 heartbeat: [6690]: info: MSG[11] : [auth=1 
> 1fef495857b200940cb7fcb61223c85b299a6a99]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[0] : [t=status]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[1] : [st=active]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[2] : [dt=5dc0]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[3] : [protocol=1]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[4] : [src=srv01]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[5] : 
> [(1)srcuuid=0x98dcb20(36 27)]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[6] : [seq=a]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[7] : [hg=4ddb360f]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[8] : [ts=4def2869]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[9] : [ld=0.17 0.07 
> 0.01 2/73 14132]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[10] : [ttl=3]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[11] : [auth=1 
> ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### 
> node_callback() call
> Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: notice: Status update: Node 
> srv01 now has status active
> Jun  8 16:47:05 srv02 lha-snmpagent: [6707]: info: node 1: srv02, type: 
> normal, status: active
> ----------------------------------------------------------------
> 
> I think that trap of active should be handled earlier.
> 
> How do you think?

To first rule out the obvious:
Double check the time of your servers.
NTP enabled, and really synchronized?

Well, I guess that's what meant with
> I synchronized at the time of each node and took log.

So, OK.

I expect the "trap" to be send when the agent daemon
processes the message.

The message should be processed when it is delivered to this daemon --
unless the daemon was busy doing other stuff...  it is a single threaded
daemon, after all, maybe it blocked gathering information, sending or
receiving data?

It should be delivered "immediately" when the heartbeat core messaging
layer receives it from the network.

You need to figure out where the observed delay happens.

Do a tcpdump, to find out the timing of the messages on the wire.
Tune up debugging (or add your own) in the message core,
and add debugging to the hbagent.

If you can pin down where the delay happens,
you should be able to find what is causing the delay.

Is the message delayed (or lost and retransmitted?) on the network?
Is it delayed in the messaging core?
Both seem unlikely.

Or was it dispatched to the hbagent in time,
but the hbagent did not notice until two and a half minutes later?
If so, what did the agent do all this time?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to