Re: [Linux-ha-dev] [Patch] The patch which revises memory leak.
On Wed, May 02, 2012 at 10:43:36AM +0900, renayama19661...@ybb.ne.jp wrote: Hi Lars, And when it passes more than a full day * node1 32126 ?SLs 79:52 0 182 71189 24328 0.1 heartbeat: master control process * node2 31928 ?SLs 77:01 0 182 70869 24008 0.1 heartbeat: master control process Oh, I see. This is a design choice (maybe not even intentional) of the Gmain_* wrappers used throughout the heartbeat code. The real glib g_timeout_add_full(), and most other similar functions, internally do id = g_source_attach(source, ...); g_source_unref(source); return id; Thus in g_main_dispatch, the need_destroy = ! dispatch (...) if (need_destroy) g_source_destroy_internal() in fact ends up destroying it, if dispatch() returns FALSE, as documented: The function is called repeatedly until it returns FALSE, at which point the timeout is automatically destroyed and the function will not be called again. Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full. It does not g_source_unref(), so we keep the extra reference around until someone explicitly calls Gmain_timeout_remove(). Talk about principle of least surprise :( Changing this behaviour to match glib's, i.e. unref'ing after g_source_attach, would seem like the correct thing to do, but is likely to break other pieces of code in subtle ways, so it may not be the right thing to do at this point. I'm going to take your patch more or less as is. If it does not show up soon, prod me again. Thank you for tracking this down. Best Regards, Hideo Yamauchi. --- On Tue, 2012/5/1, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp wrote: Hi Lars, We confirmed that this problem occurred with v1 mode of Heartbeat. * The problem happens with the v2 mode in the same way. We confirmed a problem in the next procedure. Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network. Step 2) Between nodes, the retransmission of the message is carried out repeatedly. Step 3) Then the memory of the master process increases little by little. As a result of the ps command of the master process -- * node1 (start) 32126 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process (One hour later) 32126 ? SLs 0:03 0 182 54729 7868 0.0 heartbeat: master control process (Two hour later) 32126 ? SLs 0:08 0 182 55317 8456 0.0 heartbeat: master control process (Four hours later) 32126 ? SLs 0:24 0 182 56673 9812 0.0 heartbeat: master control process * node2 (start) 31928 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process (One hour later) 31928 ? SLs 0:02 0 182 54481 7620 0.0 heartbeat: master control process (Two hour later) 31928 ? SLs 0:08 0 182 55353 8492 0.0 heartbeat: master control process (Four hours later) 31928 ? SLs 0:23 0 182 56689 9828 0.0 heartbeat: master control process The state of the memory leak seems to vary according to a node with the quantity of the retransmission. The increase of this memory disappears by applying my patch. And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak. Best Regards, Hideo Yamauchi. --- On Sat, 2012/4/28, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp wrote: Hi Lars, Thank you for comments. Have you actually been able to measure that memory leak you observed, and you can confirm this patch will fix it? Because I don't think this patch has any effect. Yes. I really measured leak. I can show a result next week. #Japan is a holiday until Tuesday. send_rexmit_request() is only used as paramter to Gmain_timeout_add_full, and it returns FALSE always, which should cause the respective sourceid to be auto-removed. It seems to be necessary to release gsource somehow or other. The similar liberation seems to be carried out in lrmd. Best Regards, Hideo Yamauchi. --- On Fri, 2012/4/27, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp wrote: Hi All, We gave test that assumed remote cluster environment. And we tested packet lost. The retransmission timer of Heartbeat causes memory leak. I donate a patch. Please confirm the contents of the patch. And please reflect a patch in a repository of Heartbeat. Have you actually been able to measure that memory leak you observed, and you can confirm this patch will fix it? Because I don't think
Re: [Linux-ha-dev] [Patch] The patch which revises memory leak.
Hi Lars, Thank you for comments. And when it passes more than a full day * node1 32126 ? SLs 79:52 0 182 71189 24328 0.1 heartbeat: master control process * node2 31928 ? SLs 77:01 0 182 70869 24008 0.1 heartbeat: master control process Oh, I see. This is a design choice (maybe not even intentional) of the Gmain_* wrappers used throughout the heartbeat code. The real glib g_timeout_add_full(), and most other similar functions, internally do id = g_source_attach(source, ...); g_source_unref(source); return id; Thus in g_main_dispatch, the need_destroy = ! dispatch (...) if (need_destroy) g_source_destroy_internal() in fact ends up destroying it, if dispatch() returns FALSE, as documented: The function is called repeatedly until it returns FALSE, at which point the timeout is automatically destroyed and the function will not be called again. Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full. It does not g_source_unref(), so we keep the extra reference around until someone explicitly calls Gmain_timeout_remove(). Talk about principle of least surprise :( Changing this behaviour to match glib's, i.e. unref'ing after g_source_attach, would seem like the correct thing to do, but is likely to break other pieces of code in subtle ways, so it may not be the right thing to do at this point. Thank you for detailed explanation. If you found the method that is appropriate than the correction that I suggested, I approve of it. I'm going to take your patch more or less as is. If it does not show up soon, prod me again. All right. Many Thanks! Hideo Yamauchi. Thank you for tracking this down. Best Regards, Hideo Yamauchi. --- On Tue, 2012/5/1, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp wrote: Hi Lars, We confirmed that this problem occurred with v1 mode of Heartbeat. * The problem happens with the v2 mode in the same way. We confirmed a problem in the next procedure. Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network. Step 2) Between nodes, the retransmission of the message is carried out repeatedly. Step 3) Then the memory of the master process increases little by little. As a result of the ps command of the master process -- * node1 (start) 32126 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process (One hour later) 32126 ? SLs 0:03 0 182 54729 7868 0.0 heartbeat: master control process (Two hour later) 32126 ? SLs 0:08 0 182 55317 8456 0.0 heartbeat: master control process (Four hours later) 32126 ? SLs 0:24 0 182 56673 9812 0.0 heartbeat: master control process * node2 (start) 31928 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process (One hour later) 31928 ? SLs 0:02 0 182 54481 7620 0.0 heartbeat: master control process (Two hour later) 31928 ? SLs 0:08 0 182 55353 8492 0.0 heartbeat: master control process (Four hours later) 31928 ? SLs 0:23 0 182 56689 9828 0.0 heartbeat: master control process The state of the memory leak seems to vary according to a node with the quantity of the retransmission. The increase of this memory disappears by applying my patch. And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak. Best Regards, Hideo Yamauchi. --- On Sat, 2012/4/28, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp wrote: Hi Lars, Thank you for comments. Have you actually been able to measure that memory leak you observed, and you can confirm this patch will fix it? Because I don't think this patch has any effect. Yes. I really measured leak. I can show a result next week. #Japan is a holiday until Tuesday. send_rexmit_request() is only used as paramter to Gmain_timeout_add_full, and it returns FALSE always, which should cause the respective sourceid to be auto-removed. It seems to be necessary to release gsource somehow or other. The similar liberation seems to be carried out in lrmd. Best Regards, Hideo Yamauchi. --- On Fri, 2012/4/27, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp wrote: Hi All, We gave test that assumed remote cluster environment. And we tested packet lost. The retransmission timer of Heartbeat causes memory leak. I donate a
Re: [Linux-ha-dev] [Patch] The patch which revises memory leak.
This is very interesting. My apologies for missing this memory leak :-(. The code logs memory usage periodically exactly to help notice such a thing. In my new open source project [http://assimmon.org], I am death on memory leaks. But I can assure you that back when that code was written, it was not at all clear who deleted what memory when - when it came to the glib. I'm not sure if valgrind was out back then, but I certainly didn't know about it. I confess that even on this new project I had a heck of a time making all the glib objects go away. I finally got them cleaned up - but it took weeks of running under valgrind before I worked out when to do what to make it throw the objects away - but not crash due to a bad reference. By the way, I suspect Lars' suggestion would work fine. I would certainly explain what the better patch is in the comments when you apply this one. On 05/02/2012 04:57 PM, renayama19661...@ybb.ne.jp wrote: Hi Lars, Thank you for comments. And when it passes more than a full day * node1 32126 ?SLs 79:52 0 182 71189 24328 0.1 heartbeat: master control process * node2 31928 ?SLs 77:01 0 182 70869 24008 0.1 heartbeat: master control process Oh, I see. This is a design choice (maybe not even intentional) of the Gmain_* wrappers used throughout the heartbeat code. The real glib g_timeout_add_full(), and most other similar functions, internally do id = g_source_attach(source, ...); g_source_unref(source); return id; Thus in g_main_dispatch, the need_destroy = ! dispatch (...) if (need_destroy) g_source_destroy_internal() in fact ends up destroying it, if dispatch() returns FALSE, as documented: The function is called repeatedly until it returns FALSE, at which point the timeout is automatically destroyed and the function will not be called again. Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full. It does not g_source_unref(), so we keep the extra reference around until someone explicitly calls Gmain_timeout_remove(). Talk about principle of least surprise :( Changing this behaviour to match glib's, i.e. unref'ing after g_source_attach, would seem like the correct thing to do, but is likely to break other pieces of code in subtle ways, so it may not be the right thing to do at this point. Thank you for detailed explanation. If you found the method that is appropriate than the correction that I suggested, I approve of it. I'm going to take your patch more or less as is. If it does not show up soon, prod me again. All right. Many Thanks! Hideo Yamauchi. Thank you for tracking this down. Best Regards, Hideo Yamauchi. --- On Tue, 2012/5/1, renayama19661...@ybb.ne.jprenayama19661...@ybb.ne.jp wrote: Hi Lars, We confirmed that this problem occurred with v1 mode of Heartbeat. * The problem happens with the v2 mode in the same way. We confirmed a problem in the next procedure. Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network. Step 2) Between nodes, the retransmission of the message is carried out repeatedly. Step 3) Then the memory of the master process increases little by little. As a result of the ps command of the master process -- * node1 (start) 32126 ?SLs0:00 0 182 53989 7128 0.0 heartbeat: master control process (One hour later) 32126 ?SLs0:03 0 182 54729 7868 0.0 heartbeat: master control process (Two hour later) 32126 ?SLs0:08 0 182 55317 8456 0.0 heartbeat: master control process (Four hours later) 32126 ?SLs0:24 0 182 56673 9812 0.0 heartbeat: master control process * node2 (start) 31928 ?SLs0:00 0 182 53989 7128 0.0 heartbeat: master control process (One hour later) 31928 ?SLs0:02 0 182 54481 7620 0.0 heartbeat: master control process (Two hour later) 31928 ?SLs0:08 0 182 55353 8492 0.0 heartbeat: master control process (Four hours later) 31928 ?SLs0:23 0 182 56689 9828 0.0 heartbeat: master control process The state of the memory leak seems to vary according to a node with the quantity of the retransmission. The increase of this memory disappears by applying my patch. And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak. Best Regards, Hideo Yamauchi. --- On Sat, 2012/4/28, renayama19661...@ybb.ne.jprenayama19661...@ybb.ne.jp wrote: Hi Lars, Thank you for comments. Have you actually been able to measure that memory leak you observed, and you can confirm this patch will fix it? Because I don't think this patch has any effect. Yes. I really measured leak. I can show a result next week. #Japan is a holiday until Tuesday.