Hi Lars,

Thank you for advice.
I confirm the details of the problem using tools such as tcpdump again.

Please wait....

Best Regards,
Hideo Yamauchi.


--- On Fri, 2011/7/22, Lars Ellenberg <[email protected]> wrote:

> On Fri, Jul 22, 2011 at 10:18:18AM +0900, [email protected] wrote:
> > Hi Lars,
> > 
> > Thank you for comment.
> > 
> > > You get a *node* active.
> > > Why do you think this is wrong?
> > > Which timing would have been "proper", and why?
> > 
> > When I examined it before, I changed a source and obtained the following 
> > result.
> > 
> > I synchronized at the time of each node and took log.
> > 
> > It is 16:44:41 that srv01 node processed F_STATUS message of active.
> > ----------------------------------------------------------------
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi 
> > send_cluster_msg() : add_controls ###
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 
> > 12 fields
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : 
> > [(1)srcuuid=0x9f292d8(36 27)]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 
> > 2/73 14132]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 
> > ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi 
> > HBDoMsg_T_STATUS RECV : heartbeat_monitor NOCHANGE
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 
> > 12 fields
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : 
> > [(1)srcuuid=0x9f292d8(36 27)]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 
> > 2/73 14132]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 
> > ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
> > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: Local status now set to: 
> > 'active'
> > ----------------------------------------------------------------
> > 
> > But, it is 16:47:04 that srv02 node received F_STATUS message.
> > ----------------------------------------------------------------
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### 
> > T_STATUS
> > Jun  8 16:47:04 srv02 heartbeat: [6690]: info: MSG[10] : [ttl=3]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG: Dumping message 
> > with 12 fields
> > Jun  8 16:47:04 srv02 heartbeat: [6690]: info: MSG[11] : [auth=1 
> > 1fef495857b200940cb7fcb61223c85b299a6a99]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[0] : [t=status]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[1] : [st=active]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[2] : [dt=5dc0]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[3] : [protocol=1]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[4] : [src=srv01]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[5] : 
> > [(1)srcuuid=0x98dcb20(36 27)]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[6] : [seq=a]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[7] : [hg=4ddb360f]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[8] : [ts=4def2869]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[9] : [ld=0.17 0.07 
> > 0.01 2/73 14132]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[10] : [ttl=3]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[11] : [auth=1 
> > ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### 
> > node_callback() call
> > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: notice: Status update: Node 
> > srv01 now has status active
> > Jun  8 16:47:05 srv02 lha-snmpagent: [6707]: info: node 1: srv02, type: 
> > normal, status: active
> > ----------------------------------------------------------------
> > 
> > I think that trap of active should be handled earlier.
> > 
> > How do you think?
> 
> To first rule out the obvious:
> Double check the time of your servers.
> NTP enabled, and really synchronized?
> 
> Well, I guess that's what meant with
> > I synchronized at the time of each node and took log.
> 
> So, OK.
> 
> I expect the "trap" to be send when the agent daemon
> processes the message.
> 
> The message should be processed when it is delivered to this daemon --
> unless the daemon was busy doing other stuff...  it is a single threaded
> daemon, after all, maybe it blocked gathering information, sending or
> receiving data?
> 
> It should be delivered "immediately" when the heartbeat core messaging
> layer receives it from the network.
> 
> You need to figure out where the observed delay happens.
> 
> Do a tcpdump, to find out the timing of the messages on the wire.
> Tune up debugging (or add your own) in the message core,
> and add debugging to the hbagent.
> 
> If you can pin down where the delay happens,
> you should be able to find what is causing the delay.
> 
> Is the message delayed (or lost and retransmitted?) on the network?
> Is it delayed in the messaging core?
> Both seem unlikely.
> 
> Or was it dispatched to the hbagent in time,
> but the hbagent did not notice until two and a half minutes later?
> If so, what did the agent do all this time?
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to