Re: [Linux-HA] The active trap of the SNMP is delayed.

renayama19661014 Thu, 21 Jul 2011 18:18:39 -0700

Hi Lars,

Thank you for comment.


> You get a *node* active.
> Why do you think this is wrong?
> Which timing would have been "proper", and why?

When I examined it before, I changed a source and obtained the following result.

I synchronized at the time of each node and took log.

It is 16:44:41 that srv01 node processed F_STATUS message of active.
----------------------------------------------------------------
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi send_cluster_msg() 
: add_controls ###
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 12 
fields
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : 
[(1)srcuuid=0x9f292d8(36 27)]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 
2/73 14132]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 
ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi HBDoMsg_T_STATUS 
RECV : heartbeat_monitor NOCHANGE
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 12 
fields
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : 
[(1)srcuuid=0x9f292d8(36 27)]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 0.01 
2/73 14132]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 
ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
Jun  8 16:44:41 srv01 heartbeat: [14110]: info: Local status now set to: 
'active'
----------------------------------------------------------------

But, it is 16:47:04 that srv02 node received F_STATUS message.
----------------------------------------------------------------
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### T_STATUS
Jun  8 16:47:04 srv02 heartbeat: [6690]: info: MSG[10] : [ttl=3]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG: Dumping message with 12 
fields
Jun  8 16:47:04 srv02 heartbeat: [6690]: info: MSG[11] : [auth=1 
1fef495857b200940cb7fcb61223c85b299a6a99]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[0] : [t=status]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[1] : [st=active]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[2] : [dt=5dc0]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[3] : [protocol=1]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[4] : [src=srv01]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[5] : 
[(1)srcuuid=0x98dcb20(36 27)]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[6] : [seq=a]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[7] : [hg=4ddb360f]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[8] : [ts=4def2869]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[9] : [ld=0.17 0.07 0.01 
2/73 14132]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[10] : [ttl=3]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[11] : [auth=1 
ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### 
node_callback() call
Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: notice: Status update: Node srv01 
now has status active
Jun  8 16:47:05 srv02 lha-snmpagent: [6707]: info: node 1: srv02, type: normal, 
status: active
----------------------------------------------------------------

I think that trap of active should be handled earlier.

How do you think?

Best Regards,
Hideo Yamauchi.


--- On Fri, 2011/7/22, Lars Ellenberg <[email protected]> wrote:

> On Tue, Jul 19, 2011 at 11:04:51AM +0900, [email protected] wrote:
> > Hi All,
> > 
> > We are troubled in the face of this problem.
> > Please give advice.
> > 
> > * This problem changed the destination of the mailing list to seem to be a 
> > problem of the HA.
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > 
> > 
> > --- On Fri, 2011/6/17, [email protected] 
> > <[email protected]> wrote:
> > 
> > > Hi All,
> > > 
> > > I registered this problem in Bugzilla.
> > > 
> > >  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2604
> > > 
> > > Best Regards,
> > > Hideo Yamauch.
> > > 
> > > --- On Wed, 2011/6/15, [email protected] 
> > > <[email protected]> wrote:
> > > 
> > > > Hi All,
> > > > 
> > > > I found a problem with a trap of the SNMP.(from hbagent.)
> > > >
> > > > A trap of active of the node seems to have possibilities to be delayed.
> > > > 
> > > > In addition, this problem sometimes occurs and does not always occur.
> > > > 
> > > > 
> > > > I confirmed it in the next procedure.
> > > > 
> > > > Step1) Start a node.
> > > > 
> > > > ============
> > > > Last updated: Wed Jun 15 19:23:39 2011
> > > > Stack: Heartbeat
> > > > Current DC: srv02 (afe72fff-b7b4-4663-b845-872df29c635d) - partition 
> > > > WITHOUT quorum
> > > > Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
> > > > 2 Nodes configured, unknown expected votes
> > > > 1 Resources configured.
> > > > ============
> > > > 
> > > > Online: [ srv01 srv02 ]
> > > > 
> > > >  Resource Group: group-1
> > > >      prmDummy1  (ocf::heartbeat:Dummy): Started srv01
> > > > 
> > > > Migration summary:
> > > > * Node srv02: 
> > > > * Node srv01: 
> > > > 
> > > > 
> > > > Step2) Intercept one interface of the Heartbeat communication.
> > > > 
> > > > # iptables -A INPUT -i eth1 -s ! 192.168.10.110 -j DROP
> > > > # iptables -A INPUT -i eth1 -s ! 192.168.10.120 -j DROP
> > > > 
> > > > 
> > > > Step3) The next trap is received in SNMP managers.
> > > > 
> > > > (snip)
> > > > Jun 15 19:24:30 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:30 
> > > > <UNKNOWN> [UDP: [192.168.40.120]:59010]: 
> > > > DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (23014) 0:03:50.14     
> > > >   SNMPv2-MIB::snmpTrapOID.0 = OID: LINUX-HA-MIB::LHAIFStatusUpdate      
> > > >   LINUX-HA-MIB::LHANodeName = STRING: srv01       
> > > > LINUX-HA-MIB::LHAIFName = STRING: eth1       LINUX-HA-MIB::LHAIFStatus 
> > > > = INTEGER: down(2) 
> > > >    ----> No problem.
> > > > Jun 15 19:24:32 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:32 
> > > > <UNKNOWN> [UDP: [192.168.40.110]:44001]: 
> > > > DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (23597) 0:03:55.97     
> > > >   SNMPv2-MIB::snmpTrapOID.0 = OID: LINUX-HA-MIB::LHANodeStatusUpdate    
> > > >   LINUX-HA-MIB::LHANodeName = STRING: srv02       
> > > > LINUX-HA-MIB::LHANodeStatus = INTEGER: active(3)
> > > >    ----> The trap of active is improper in this timing.
> 
> Why?
> 
> > > > Jun 15 19:24:34 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:34 
> > > > <UNKNOWN> [UDP: [192.168.40.110]:44001]: 
> > > > DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (23803) 0:03:58.03     
> > > >   SNMPv2-MIB::snmpTrapOID.0 = OID: LINUX-HA-MIB::LHAIFStatusUpdate      
> > > >   LINUX-HA-MIB::LHANodeName = STRING: srv02       
> > > > LINUX-HA-MIB::LHAIFName = STRING: eth1       LINUX-HA-MIB::LHAIFStatus 
> > > > = INTEGER: down(2) 
> > > >    ----> No problem.
> > > > (snip)
> > > > 
> > > > Between the traps which interface intercepted, it is strange that the 
> > > > active trap of the node comes.
> > > > 
> > > > And I think that it is necessary for the active trap to be sent in an 
> > > > earlier timing.
> > > > 
> > > > 
> > > > This problem seems to happen in Heartbeat2.1.4.
> > > > 
> > > > I watched some sources, but think that client_lib of Heartbeat has a 
> > > > problem somehow or other.
> > > > Transmitted F_STATUS message is late and seems to be handled.
> 
> hbagent is no longer in the heartbeat code.
> According to mercurial, it was removed three years ago.
> I doubt it is/was used by many.
> So I fear you won't get much help for this.
> 
> 
> Still, I don't see "the problem".
> You have two communication channels configured.
> You block one.
> You get a *link* down trap, immediately, probably because sending fails
> locally if you do iptables -j DROP.
> 
> > > > Jun 15 19:24:30 snmp-manager snmptrapd[4771]: LHAIFStatusUpdate 
> > > > LHANodeName srv01 LHAIFName eth1 LHAIFStatus down(2)
> 
> 
> You get a *node* active.
> Why do you think this is wrong?
> Which timing would have been "proper", and why?
> 
> > > > Jun 15 19:24:32 snmp-manager snmptrapd[4771]: 
> > > > LHANodeStatusUpdate LHANodeName srv02 LHANodeStatus active(3)
> 
> 
> And after timeout, you get the *link* down to the other node.
> 
> > > > Jun 15 19:24:34 snmp-manager snmptrapd[4771]: LHAIFStatusUpdate 
> > > > LHANodeName srv02 LHAIFName eth1 LHAIFStatus down(2) 
> 
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] The active trap of the SNMP is delayed.

Reply via email to