Hi Lars, Thank you for advice. I confirm the details of the problem using tools such as tcpdump again.
Please wait.... Best Regards, Hideo Yamauchi. --- On Fri, 2011/7/22, Lars Ellenberg <[email protected]> wrote: > On Fri, Jul 22, 2011 at 10:18:18AM +0900, [email protected] wrote: > > Hi Lars, > > > > Thank you for comment. > > > > > You get a *node* active. > > > Why do you think this is wrong? > > > Which timing would have been "proper", and why? > > > > When I examined it before, I changed a source and obtained the following > > result. > > > > I synchronized at the time of each node and took log. > > > > It is 16:44:41 that srv01 node processed F_STATUS message of active. > > ---------------------------------------------------------------- > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi > > send_cluster_msg() : add_controls ### > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with > > 12 fields > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : > > [(1)srcuuid=0x9f292d8(36 27)] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 > > 2/73 14132] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 > > ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi > > HBDoMsg_T_STATUS RECV : heartbeat_monitor NOCHANGE > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with > > 12 fields > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : > > [(1)srcuuid=0x9f292d8(36 27)] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 > > 2/73 14132] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 > > ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa] > > Jun 8 16:44:41 srv01 heartbeat: [14110]: info: Local status now set to: > > 'active' > > ---------------------------------------------------------------- > > > > But, it is 16:47:04 that srv02 node received F_STATUS message. > > ---------------------------------------------------------------- > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### > > T_STATUS > > Jun 8 16:47:04 srv02 heartbeat: [6690]: info: MSG[10] : [ttl=3] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG: Dumping message > > with 12 fields > > Jun 8 16:47:04 srv02 heartbeat: [6690]: info: MSG[11] : [auth=1 > > 1fef495857b200940cb7fcb61223c85b299a6a99] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[0] : [t=status] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[1] : [st=active] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[2] : [dt=5dc0] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[3] : [protocol=1] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[4] : [src=srv01] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[5] : > > [(1)srcuuid=0x98dcb20(36 27)] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[6] : [seq=a] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[7] : [hg=4ddb360f] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[8] : [ts=4def2869] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[9] : [ld=0.17 0.07 > > 0.01 2/73 14132] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[10] : [ttl=3] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[11] : [auth=1 > > ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa] > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### > > node_callback() call > > Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: notice: Status update: Node > > srv01 now has status active > > Jun 8 16:47:05 srv02 lha-snmpagent: [6707]: info: node 1: srv02, type: > > normal, status: active > > ---------------------------------------------------------------- > > > > I think that trap of active should be handled earlier. > > > > How do you think? > > To first rule out the obvious: > Double check the time of your servers. > NTP enabled, and really synchronized? > > Well, I guess that's what meant with > > I synchronized at the time of each node and took log. > > So, OK. > > I expect the "trap" to be send when the agent daemon > processes the message. > > The message should be processed when it is delivered to this daemon -- > unless the daemon was busy doing other stuff... it is a single threaded > daemon, after all, maybe it blocked gathering information, sending or > receiving data? > > It should be delivered "immediately" when the heartbeat core messaging > layer receives it from the network. > > You need to figure out where the observed delay happens. > > Do a tcpdump, to find out the timing of the messages on the wire. > Tune up debugging (or add your own) in the message core, > and add debugging to the hbagent. > > If you can pin down where the delay happens, > you should be able to find what is causing the delay. > > Is the message delayed (or lost and retransmitted?) on the network? > Is it delayed in the messaging core? > Both seem unlikely. > > Or was it dispatched to the hbagent in time, > but the hbagent did not notice until two and a half minutes later? > If so, what did the agent do all this time? > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
