Re: [Linux-HA] The active trap of the SNMP is delayed.

renayama19661014 Mon, 25 Jul 2011 21:43:23 -0700

Hi Lars,
Hi All,

A cause to be delayed became clear.


This problem occurs by a timing.

When hbagent receives F_STATUS message while hbagent waits for a reply of the 
api communication, F_STATUS is performed queueing of.

When hbagent caught the event from Heartbeat, this message is handled.
Therefore, it is handled at the time of events such as one down of the 
inter-connect.

Therefore, the active trap of the node is transmitted when inter-connect fell.

/*
 * Read an API message.  All other messages are enqueued to be read later.
 */
static struct ha_msg *
read_api_msg(llc_private_t* pi)
{

        for (;;) {
                struct ha_msg*  msg;
                const char *    type;
                
                pi->chan->ops->waitin(pi->chan);
                if (pi->chan->ch_status  == IPC_DISCONNECT){
                        break;
                }
                if ((msg=msgfromIPC(pi->chan, 0)) == NULL) {
                        ha_api_perror("read_api_msg: "
                                      "Cannot read reply from IPC channel");
                        continue;
                }
                if ((type=ha_msg_value(msg, F_TYPE)) != NULL
                &&      strcmp(type, T_APIRESP) == 0) {
                        return(msg);
                }
                /* Got an unexpected non-api message */
                /* Queue it up for reading later */
                enqueue_msg(pi, msg);
        }
        /*NOTREACHED*/
        return(NULL);
}



I think that the following correction is necessary.
snmp_subagent/hbagent.c
(snip)
                        } else {

                                /* snmp request */
                                snmp_read(&fdset);

                                ret = handle_heartbeat_msg(); ----> read 
queueing msg.!!
                        }
(snip)

Best Regards,
Hideo Yamacuhi.


--- On Fri, 2011/7/22, [email protected] <[email protected]> 
wrote:

> Hi Lars,
> 
> Thank you for advice.
> I confirm the details of the problem using tools such as tcpdump again.
> 
> Please wait....
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> --- On Fri, 2011/7/22, Lars Ellenberg <[email protected]> wrote:
> 
> > On Fri, Jul 22, 2011 at 10:18:18AM +0900, [email protected] wrote:
> > > Hi Lars,
> > > 
> > > Thank you for comment.
> > > 
> > > > You get a *node* active.
> > > > Why do you think this is wrong?
> > > > Which timing would have been "proper", and why?
> > > 
> > > When I examined it before, I changed a source and obtained the following 
> > > result.
> > > 
> > > I synchronized at the time of each node and took log.
> > > 
> > > It is 16:44:41 that srv01 node processed F_STATUS message of active.
> > > ----------------------------------------------------------------
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi 
> > > send_cluster_msg() : add_controls ###
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 
> > > 12 fields
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : 
> > > [(1)srcuuid=0x9f292d8(36 27)]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 
> > > 0.01 2/73 14132]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 
> > > ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: ###yamauchi 
> > > HBDoMsg_T_STATUS RECV : heartbeat_monitor NOCHANGE
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG: Dumping message with 
> > > 12 fields
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[0] : [t=status]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[1] : [st=active]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[2] : [dt=5dc0]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[3] : [protocol=1]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[4] : [src=srv01]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[5] : 
> > > [(1)srcuuid=0x9f292d8(36 27)]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[6] : [seq=a]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[7] : [hg=4ddb360f]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[8] : [ts=4def2869]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[9] : [ld=0.17 0.07 
> > > 0.01 2/73 14132]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[10] : [ttl=3]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: MSG[11] : [auth=1 
> > > ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
> > > Jun  8 16:44:41 srv01 heartbeat: [14110]: info: Local status now set to: 
> > > 'active'
> > > ----------------------------------------------------------------
> > > 
> > > But, it is 16:47:04 that srv02 node received F_STATUS message.
> > > ----------------------------------------------------------------
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### 
> > > T_STATUS
> > > Jun  8 16:47:04 srv02 heartbeat: [6690]: info: MSG[10] : [ttl=3]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG: Dumping message 
> > > with 12 fields
> > > Jun  8 16:47:04 srv02 heartbeat: [6690]: info: MSG[11] : [auth=1 
> > > 1fef495857b200940cb7fcb61223c85b299a6a99]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[0] : [t=status]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[1] : [st=active]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[2] : [dt=5dc0]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[3] : [protocol=1]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[4] : [src=srv01]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[5] : 
> > > [(1)srcuuid=0x98dcb20(36 27)]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[6] : [seq=a]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[7] : [hg=4ddb360f]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[8] : [ts=4def2869]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[9] : [ld=0.17 0.07 
> > > 0.01 2/73 14132]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[10] : [ttl=3]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: MSG[11] : [auth=1 
> > > ee7d14643b83b7e49684cf0d679ee7e6a0ea3aaa]
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: info: #### yamauchi ##### 
> > > node_callback() call
> > > Jun  8 16:47:04 srv02 lha-snmpagent: [6707]: notice: Status update: Node 
> > > srv01 now has status active
> > > Jun  8 16:47:05 srv02 lha-snmpagent: [6707]: info: node 1: srv02, type: 
> > > normal, status: active
> > > ----------------------------------------------------------------
> > > 
> > > I think that trap of active should be handled earlier.
> > > 
> > > How do you think?
> > 
> > To first rule out the obvious:
> > Double check the time of your servers.
> > NTP enabled, and really synchronized?
> > 
> > Well, I guess that's what meant with
> > > I synchronized at the time of each node and took log.
> > 
> > So, OK.
> > 
> > I expect the "trap" to be send when the agent daemon
> > processes the message.
> > 
> > The message should be processed when it is delivered to this daemon --
> > unless the daemon was busy doing other stuff...  it is a single threaded
> > daemon, after all, maybe it blocked gathering information, sending or
> > receiving data?
> > 
> > It should be delivered "immediately" when the heartbeat core messaging
> > layer receives it from the network.
> > 
> > You need to figure out where the observed delay happens.
> > 
> > Do a tcpdump, to find out the timing of the messages on the wire.
> > Tune up debugging (or add your own) in the message core,
> > and add debugging to the hbagent.
> > 
> > If you can pin down where the delay happens,
> > you should be able to find what is causing the delay.
> > 
> > Is the message delayed (or lost and retransmitted?) on the network?
> > Is it delayed in the messaging core?
> > Both seem unlikely.
> > 
> > Or was it dispatched to the hbagent in time,
> > but the hbagent did not notice until two and a half minutes later?
> > If so, what did the agent do all this time?
> > 
> > -- 
> > : Lars Ellenberg
> > : LINBIT | Your Way to High Availability
> > : DRBD/HA support and consulting http://www.linbit.com
> > 
> > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> > 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] The active trap of the SNMP is delayed.

Reply via email to