Hi Lars, Thank you for comment.
> You get a *node* active. > Why do you think this is wrong? > Which timing would have been "proper", and why? When I examined it before, I changed a source and obtained the following result. I synchronized at the time of each node and took log. It is 16:44:41 that srv01 node processed F_STATUS message of active. ---------------------------------------------------------------- Jun 8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi send_cluster_msg() : add_controls ### Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 12 fields Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : [(1)srcuuid=0x9f292d8(36 27)] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 2/73 14132] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi HBDoMsg_T_STATUS RECV : heartbeat_monitor NOCHANGE Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 12 fields Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : [(1)srcuuid=0x9f292d8(36 27)] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 2/73 14132] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa] Jun 8 16:44:41 srv01 heartbeat: [14110]: info: Local status now set to: 'active' ---------------------------------------------------------------- But, it is 16:47:04 that srv02 node received F_STATUS message. ---------------------------------------------------------------- Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### T_STATUS Jun 8 16:47:04 srv02 heartbeat: [6690]: info: MSG[10] : [ttl=3] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG: Dumping message with 12 fields Jun 8 16:47:04 srv02 heartbeat: [6690]: info: MSG[11] : [auth=1 1fef495857b200940cb7fcb61223c85b299a6a99] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[0] : [t=status] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[1] : [st=active] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[2] : [dt=5dc0] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[3] : [protocol=1] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[4] : [src=srv01] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[5] : [(1)srcuuid=0x98dcb20(36 27)] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[6] : [seq=a] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[7] : [hg=4ddb360f] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[8] : [ts=4def2869] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[9] : [ld=0.17 0.07 0.01 2/73 14132] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[10] : [ttl=3] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[11] : [auth=1 ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa] Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### node_callback() call Jun 8 16:47:04 srv02 lha-snmpagent: [6707]: notice: Status update: Node srv01 now has status active Jun 8 16:47:05 srv02 lha-snmpagent: [6707]: info: node 1: srv02, type: normal, status: active ---------------------------------------------------------------- I think that trap of active should be handled earlier. How do you think? Best Regards, Hideo Yamauchi. --- On Fri, 2011/7/22, Lars Ellenberg <[email protected]> wrote: > On Tue, Jul 19, 2011 at 11:04:51AM +0900, [email protected] wrote: > > Hi All, > > > > We are troubled in the face of this problem. > > Please give advice. > > > > * This problem changed the destination of the mailing list to seem to be a > > problem of the HA. > > > > Best Regards, > > Hideo Yamauchi. > > > > > > > > --- On Fri, 2011/6/17, [email protected] > > <[email protected]> wrote: > > > > > Hi All, > > > > > > I registered this problem in Bugzilla. > > > > > > * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2604 > > > > > > Best Regards, > > > Hideo Yamauch. > > > > > > --- On Wed, 2011/6/15, [email protected] > > > <[email protected]> wrote: > > > > > > > Hi All, > > > > > > > > I found a problem with a trap of the SNMP.(from hbagent.) > > > > > > > > A trap of active of the node seems to have possibilities to be delayed. > > > > > > > > In addition, this problem sometimes occurs and does not always occur. > > > > > > > > > > > > I confirmed it in the next procedure. > > > > > > > > Step1) Start a node. > > > > > > > > ============ > > > > Last updated: Wed Jun 15 19:23:39 2011 > > > > Stack: Heartbeat > > > > Current DC: srv02 (afe72fff-b7b4-4663-b845-872df29c635d) - partition > > > > WITHOUT quorum > > > > Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04 > > > > 2 Nodes configured, unknown expected votes > > > > 1 Resources configured. > > > > ============ > > > > > > > > Online: [ srv01 srv02 ] > > > > > > > > Resource Group: group-1 > > > > prmDummy1 (ocf::heartbeat:Dummy): Started srv01 > > > > > > > > Migration summary: > > > > * Node srv02: > > > > * Node srv01: > > > > > > > > > > > > Step2) Intercept one interface of the Heartbeat communication. > > > > > > > > # iptables -A INPUT -i eth1 -s ! 192.168.10.110 -j DROP > > > > # iptables -A INPUT -i eth1 -s ! 192.168.10.120 -j DROP > > > > > > > > > > > > Step3) The next trap is received in SNMP managers. > > > > > > > > (snip) > > > > Jun 15 19:24:30 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:30 > > > > <UNKNOWN> [UDP: [192.168.40.120]:59010]: > > > > DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (23014) 0:03:50.14 > > > > SNMPv2-MIB::snmpTrapOID.0 = OID: LINUX-HA-MIB::LHAIFStatusUpdate > > > > LINUX-HA-MIB::LHANodeName = STRING: srv01 > > > > LINUX-HA-MIB::LHAIFName = STRING: eth1 LINUX-HA-MIB::LHAIFStatus > > > > = INTEGER: down(2) > > > > ----> No problem. > > > > Jun 15 19:24:32 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:32 > > > > <UNKNOWN> [UDP: [192.168.40.110]:44001]: > > > > DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (23597) 0:03:55.97 > > > > SNMPv2-MIB::snmpTrapOID.0 = OID: LINUX-HA-MIB::LHANodeStatusUpdate > > > > LINUX-HA-MIB::LHANodeName = STRING: srv02 > > > > LINUX-HA-MIB::LHANodeStatus = INTEGER: active(3) > > > > ----> The trap of active is improper in this timing. > > Why? > > > > > Jun 15 19:24:34 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:34 > > > > <UNKNOWN> [UDP: [192.168.40.110]:44001]: > > > > DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (23803) 0:03:58.03 > > > > SNMPv2-MIB::snmpTrapOID.0 = OID: LINUX-HA-MIB::LHAIFStatusUpdate > > > > LINUX-HA-MIB::LHANodeName = STRING: srv02 > > > > LINUX-HA-MIB::LHAIFName = STRING: eth1 LINUX-HA-MIB::LHAIFStatus > > > > = INTEGER: down(2) > > > > ----> No problem. > > > > (snip) > > > > > > > > Between the traps which interface intercepted, it is strange that the > > > > active trap of the node comes. > > > > > > > > And I think that it is necessary for the active trap to be sent in an > > > > earlier timing. > > > > > > > > > > > > This problem seems to happen in Heartbeat2.1.4. > > > > > > > > I watched some sources, but think that client_lib of Heartbeat has a > > > > problem somehow or other. > > > > Transmitted F_STATUS message is late and seems to be handled. > > hbagent is no longer in the heartbeat code. > According to mercurial, it was removed three years ago. > I doubt it is/was used by many. > So I fear you won't get much help for this. > > > Still, I don't see "the problem". > You have two communication channels configured. > You block one. > You get a *link* down trap, immediately, probably because sending fails > locally if you do iptables -j DROP. > > > > > Jun 15 19:24:30 snmp-manager snmptrapd[4771]: LHAIFStatusUpdate > > > > LHANodeName srv01 LHAIFName eth1 LHAIFStatus down(2) > > > You get a *node* active. > Why do you think this is wrong? > Which timing would have been "proper", and why? > > > > > Jun 15 19:24:32 snmp-manager snmptrapd[4771]: > > > > LHANodeStatusUpdate LHANodeName srv02 LHANodeStatus active(3) > > > And after timeout, you get the *link* down to the other node. > > > > > Jun 15 19:24:34 snmp-manager snmptrapd[4771]: LHAIFStatusUpdate > > > > LHANodeName srv02 LHAIFName eth1 LHAIFStatus down(2) > > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
