FYI: there is code in the heartbeat communication layer which is quite happy to simulate lost packets.
I made it difficult to turn on accidentally. Read the code for details if you're interested. On 04/30/2012 10:21 PM, renayama19661...@ybb.ne.jp wrote: > Hi Lars, > > We confirmed that this problem occurred with v1 mode of Heartbeat. > * The problem happens with the v2 mode in the same way. > > We confirmed a problem in the next procedure. > > Step 1) Put a special device extinguishing a communication packet of > Heartbeat in the network. > > Step 2) Between nodes, the retransmission of the message is carried out > repeatedly. > > Step 3) Then the memory of the master process increases little by little. > > > -------- As a result of the ps command of the master process ---------- > * node1 > (start) > 32126 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master > control process > (One hour later) > 32126 ? SLs 0:03 0 182 54729 7868 0.0 heartbeat: master > control process > (Two hour later) > 32126 ? SLs 0:08 0 182 55317 8456 0.0 heartbeat: master > control process > (Four hours later) > 32126 ? SLs 0:24 0 182 56673 9812 0.0 heartbeat: master > control process > > * node2 > (start) > 31928 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master > control process > (One hour later) > 31928 ? SLs 0:02 0 182 54481 7620 0.0 heartbeat: master > control process > (Two hour later) > 31928 ? SLs 0:08 0 182 55353 8492 0.0 heartbeat: master > control process > (Four hours later) > 31928 ? SLs 0:23 0 182 56689 9828 0.0 heartbeat: master > control process > > > The state of the memory leak seems to vary according to a node with the > quantity of the retransmission. > > The increase of this memory disappears by applying my patch. > > And the similar correspondence seems to be necessary in send_reqnodes_msg(), > but this is like little leak. > > Best Regards, > Hideo Yamauchi. > > > --- On Sat, 2012/4/28, renayama19661...@ybb.ne.jp<renayama19661...@ybb.ne.jp> > wrote: > >> Hi Lars, >> >> Thank you for comments. >> >>> Have you actually been able to measure that memory leak you observed, >>> and you can confirm this patch will fix it? >>> >>> Because I don't think this patch has any effect. >> Yes. >> I really measured leak. >> I can show a result next week. >> #Japan is a holiday until Tuesday. >> >>> send_rexmit_request() is only used as paramter to >>> Gmain_timeout_add_full, and it returns FALSE always, >>> which should cause the respective sourceid to be auto-removed. >> It seems to be necessary to release gsource somehow or other. >> The similar liberation seems to be carried out in lrmd. >> >> Best Regards, >> Hideo Yamauchi. >> >> >> --- On Fri, 2012/4/27, Lars Ellenberg<lars.ellenb...@linbit.com> wrote: >> >>> On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp wrote: >>>> Hi All, >>>> >>>> We gave test that assumed remote cluster environment. >>>> And we tested packet lost. >>>> >>>> The retransmission timer of Heartbeat causes memory leak. >>>> >>>> I donate a patch. >>>> Please confirm the contents of the patch. >>>> And please reflect a patch in a repository of Heartbeat. >>> Have you actually been able to measure that memory leak you observed, >>> and you can confirm this patch will fix it? >>> >>> Because I don't think this patch has any effect. >>> >>> send_rexmit_request() is only used as paramter to >>> Gmain_timeout_add_full, and it returns FALSE always, >>> which should cause the respective sourceid to be auto-removed. >>> >>> >>>> diff -r 106ca984041b heartbeat/hb_rexmit.c >>>> --- a/heartbeat/hb_rexmit.c Thu Apr 26 19:28:26 2012 +0900 >>>> +++ b/heartbeat/hb_rexmit.c Thu Apr 26 19:31:44 2012 +0900 >>>> @@ -164,6 +164,8 @@ >>>> seqno_t seq = (seqno_t) ri->seq; >>>> struct node_info* node = ri->node; >>>> struct ha_msg* hmsg; >>>> + unsigned long sourceid; >>>> + gpointer value; >>>> >>>> if (STRNCMP_CONST(node->status, UPSTATUS) != 0&& >>>> STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) { >>>> @@ -196,11 +198,17 @@ >>>> >>>> node->track.last_rexmit_req = time_longclock(); >>>> >>>> - if (!g_hash_table_remove(rexmit_hash_table, ri)){ >>>> - cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table" >>>> - "for seq/node(%ld %s)", >>>> - __FUNCTION__, ri->seq, ri->node->nodename); >>>> - return FALSE; >>>> + value = g_hash_table_lookup(rexmit_hash_table, ri); >>>> + if ( value != NULL) { >>>> + sourceid = (unsigned long) value; >>>> + Gmain_timeout_remove(sourceid); >>>> + >>>> + if (!g_hash_table_remove(rexmit_hash_table, ri)){ >>>> + cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table" >>>> + "for seq/node(%ld %s)", >>>> + __FUNCTION__, ri->seq, ri->node->nodename); >>>> + return FALSE; >>>> + } >>>> } >>>> >>>> schedule_rexmit_request(node, seq, max_rexmit_delay); >>> >>> -- >>> : Lars Ellenberg >>> : LINBIT | Your Way to High Availability >>> : DRBD/HA support and consulting http://www.linbit.com >>> >>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. >>> _______________________________________________________ >>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >>> Home Page: http://linux-ha.org/ >>> >> _______________________________________________________ >> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >> Home Page: http://linux-ha.org/ >> > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ -- Alan Robertson<al...@unix.sh> - @OSSAlanR "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/