Re: [Linux-ha-dev] [Patch] The patch which revises memory leak.

2012-05-02 Thread Lars Ellenberg
On Wed, May 02, 2012 at 10:43:36AM +0900, renayama19661...@ybb.ne.jp wrote:
 Hi Lars,
 
 And when it passes more than a full day
 
 * node1
 32126 ?SLs   79:52  0   182 71189 24328  0.1 heartbeat: master 
 control process
 
 * node2
 31928 ?SLs   77:01  0   182 70869 24008  0.1 heartbeat: master 
 control process

Oh, I see.

This is a design choice (maybe not even intentional) of the Gmain_*
wrappers used throughout the heartbeat code.

The real glib g_timeout_add_full(), and most other similar functions,
internally do
 id = g_source_attach(source, ...);
 g_source_unref(source);
 return id;

Thus in g_main_dispatch, the
 need_destroy = ! dispatch (...)
 if (need_destroy)
g_source_destroy_internal()

in fact ends up destroying it,
if dispatch() returns FALSE,
as documented: 
The function is called repeatedly until it returns FALSE, at
which point the timeout is automatically destroyed and the
function will not be called again.

Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full.
It does not g_source_unref(), so we keep the extra reference around
until someone explicitly calls Gmain_timeout_remove().

Talk about principle of least surprise :(

Changing this behaviour to match glib's, i.e. unref'ing after
g_source_attach, would seem like the correct thing to do,
but is likely to break other pieces of code in subtle ways,
so it may not be the right thing to do at this point.

I'm going to take your patch more or less as is.
If it does not show up soon, prod me again.

Thank you for tracking this down.

 Best Regards,
 Hideo Yamauchi.
 
 
 --- On Tue, 2012/5/1, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp 
 wrote:
 
  Hi Lars,
  
  We confirmed that this problem occurred with v1 mode of Heartbeat.
   * The problem happens with the v2 mode in the same way.
  
  We confirmed a problem in the next procedure.
  
  Step 1) Put a special device extinguishing a communication packet of 
  Heartbeat in the network.
  
  Step 2) Between nodes, the retransmission of the message is carried out 
  repeatedly.
  
  Step 3) Then the memory of the master process increases little by little.
  
  
   As a result of the ps command of the master process --
  * node1
  (start)
  32126 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master 
  control process
  (One hour later)
  32126 ?        SLs    0:03      0   182 54729  7868  0.0 heartbeat: master 
  control process
  (Two hour later)
  32126 ?        SLs    0:08      0   182 55317  8456  0.0 heartbeat: master 
  control process
  (Four hours later)
  32126 ?        SLs    0:24      0   182 56673  9812  0.0 heartbeat: master 
  control process 
  
  * node2
  (start)
  31928 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master 
  control process
  (One hour later)
  31928 ?        SLs    0:02      0   182 54481  7620  0.0 heartbeat: master 
  control process
  (Two hour later)
  31928 ?        SLs    0:08      0   182 55353  8492  0.0 heartbeat: master 
  control process
  (Four hours later)
  31928 ?        SLs    0:23      0   182 56689  9828  0.0 heartbeat: master 
  control process
  
  
  The state of the memory leak seems to vary according to a node with the 
  quantity of the retransmission.
  
  The increase of this memory disappears by applying my patch.
  
  And the similar correspondence seems to be necessary in 
  send_reqnodes_msg(), but this is like little leak.
  
  Best Regards,
  Hideo Yamauchi.
  
  
  --- On Sat, 2012/4/28, renayama19661...@ybb.ne.jp 
  renayama19661...@ybb.ne.jp wrote:
  
   Hi Lars,
   
   Thank you for comments.
   
Have you actually been able to measure that memory leak you observed,
and you can confirm this patch will fix it?

Because I don't think this patch has any effect.
   
   Yes.
   I really measured leak.
   I can show a result next week.
   #Japan is a holiday until Tuesday.
   

send_rexmit_request() is only used as paramter to
Gmain_timeout_add_full, and it returns FALSE always,
which should cause the respective sourceid to be auto-removed.
   
   It seems to be necessary to release gsource somehow or other.
   The similar liberation seems to be carried out in lrmd.
   
   Best Regards,
   Hideo Yamauchi.
   
   
   --- On Fri, 2012/4/27, Lars Ellenberg lars.ellenb...@linbit.com wrote:
   
On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp 
wrote:
 Hi All,
 
 We gave test that assumed remote cluster environment.
 And we tested packet lost.
 
 The retransmission timer of Heartbeat causes memory leak.
 
 I donate a patch.
 Please confirm the contents of the patch.
 And please reflect a patch in a repository of Heartbeat.

Have you actually been able to measure that memory leak you observed,
and you can confirm this patch will fix it?

Because I don't think 

Re: [Linux-ha-dev] [Patch] The patch which revises memory leak.

2012-05-02 Thread renayama19661014
Hi Lars,

Thank you for comments.

  
  And when it passes more than a full day
  
  * node1
  32126 ?        SLs   79:52      0   182 71189 24328  0.1 heartbeat: master 
  control process                        
  
  * node2
  31928 ?        SLs   77:01      0   182 70869 24008  0.1 heartbeat: master 
  control process
 
 Oh, I see.
 
 This is a design choice (maybe not even intentional) of the Gmain_*
 wrappers used throughout the heartbeat code.
 
 The real glib g_timeout_add_full(), and most other similar functions,
 internally do
  id = g_source_attach(source, ...);
  g_source_unref(source);
  return id;
 
 Thus in g_main_dispatch, the
  need_destroy = ! dispatch (...)
  if (need_destroy)
      g_source_destroy_internal()
 
 in fact ends up destroying it,
 if dispatch() returns FALSE,
 as documented: 
     The function is called repeatedly until it returns FALSE, at
     which point the timeout is automatically destroyed and the
     function will not be called again.
 
 Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full.
 It does not g_source_unref(), so we keep the extra reference around
 until someone explicitly calls Gmain_timeout_remove().
 
 Talk about principle of least surprise :(
 
 Changing this behaviour to match glib's, i.e. unref'ing after
 g_source_attach, would seem like the correct thing to do,
 but is likely to break other pieces of code in subtle ways,
 so it may not be the right thing to do at this point.

Thank you for detailed explanation.
If you found the method that is appropriate than the correction that I 
suggested, I approve of it.

 I'm going to take your patch more or less as is.
 If it does not show up soon, prod me again.
 

All right.

Many Thanks!
Hideo Yamauchi.  


 Thank you for tracking this down.
 
  Best Regards,
  Hideo Yamauchi.
  
  
  --- On Tue, 2012/5/1, renayama19661...@ybb.ne.jp 
  renayama19661...@ybb.ne.jp wrote:
  
   Hi Lars,
   
   We confirmed that this problem occurred with v1 mode of Heartbeat.
    * The problem happens with the v2 mode in the same way.
   
   We confirmed a problem in the next procedure.
   
   Step 1) Put a special device extinguishing a communication packet of 
   Heartbeat in the network.
   
   Step 2) Between nodes, the retransmission of the message is carried out 
   repeatedly.
   
   Step 3) Then the memory of the master process increases little by little.
   
   
    As a result of the ps command of the master process --
   * node1
   (start)
   32126 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: 
   master control process
   (One hour later)
   32126 ?        SLs    0:03      0   182 54729  7868  0.0 heartbeat: 
   master control process
   (Two hour later)
   32126 ?        SLs    0:08      0   182 55317  8456  0.0 heartbeat: 
   master control process
   (Four hours later)
   32126 ?        SLs    0:24      0   182 56673  9812  0.0 heartbeat: 
   master control process 
   
   * node2
   (start)
   31928 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: 
   master control process
   (One hour later)
   31928 ?        SLs    0:02      0   182 54481  7620  0.0 heartbeat: 
   master control process
   (Two hour later)
   31928 ?        SLs    0:08      0   182 55353  8492  0.0 heartbeat: 
   master control process
   (Four hours later)
   31928 ?        SLs    0:23      0   182 56689  9828  0.0 heartbeat: 
   master control process
   
   
   The state of the memory leak seems to vary according to a node with the 
   quantity of the retransmission.
   
   The increase of this memory disappears by applying my patch.
   
   And the similar correspondence seems to be necessary in 
   send_reqnodes_msg(), but this is like little leak.
   
   Best Regards,
   Hideo Yamauchi.
   
   
   --- On Sat, 2012/4/28, renayama19661...@ybb.ne.jp 
   renayama19661...@ybb.ne.jp wrote:
   
Hi Lars,

Thank you for comments.

 Have you actually been able to measure that memory leak you observed,
 and you can confirm this patch will fix it?
 
 Because I don't think this patch has any effect.

Yes.
I really measured leak.
I can show a result next week.
#Japan is a holiday until Tuesday.

 
 send_rexmit_request() is only used as paramter to
 Gmain_timeout_add_full, and it returns FALSE always,
 which should cause the respective sourceid to be auto-removed.

It seems to be necessary to release gsource somehow or other.
The similar liberation seems to be carried out in lrmd.

Best Regards,
Hideo Yamauchi.


--- On Fri, 2012/4/27, Lars Ellenberg lars.ellenb...@linbit.com wrote:

 On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp 
 wrote:
  Hi All,
  
  We gave test that assumed remote cluster environment.
  And we tested packet lost.
  
  The retransmission timer of Heartbeat causes memory leak.
  
  I donate a 

Re: [Linux-ha-dev] [Patch] The patch which revises memory leak.

2012-05-02 Thread Alan Robertson
This is very interesting.  My apologies for missing this memory leak 
:-(.  The code logs memory usage periodically exactly to help notice 
such a thing.

In my new open source project [http://assimmon.org], I am death on 
memory leaks.  But I can assure you that back when that code was 
written, it was not at all clear who deleted what memory when - when it 
came to the glib.  I'm not sure if valgrind was out back then, but I 
certainly didn't know about it.

I confess that even on this new project I had a heck of a time making 
all the glib objects go away.  I finally got them cleaned up - but it 
took weeks of running under valgrind before I worked out when to do what 
to make it throw the objects away - but not crash due to a bad reference.

By the way, I suspect Lars' suggestion would work fine.  I would 
certainly explain what the better patch is in the comments when you 
apply this one.


On 05/02/2012 04:57 PM, renayama19661...@ybb.ne.jp wrote:
 Hi Lars,

 Thank you for comments.

 And when it passes more than a full day

 * node1
 32126 ?SLs   79:52  0   182 71189 24328  0.1 heartbeat: master 
 control process   

 * node2
 31928 ?SLs   77:01  0   182 70869 24008  0.1 heartbeat: master 
 control process
 Oh, I see.

 This is a design choice (maybe not even intentional) of the Gmain_*
 wrappers used throughout the heartbeat code.

 The real glib g_timeout_add_full(), and most other similar functions,
 internally do
   id = g_source_attach(source, ...);
   g_source_unref(source);
   return id;

 Thus in g_main_dispatch, the
   need_destroy = ! dispatch (...)
   if (need_destroy)
   g_source_destroy_internal()

 in fact ends up destroying it,
 if dispatch() returns FALSE,
 as documented:
  The function is called repeatedly until it returns FALSE, at
  which point the timeout is automatically destroyed and the
  function will not be called again.

 Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full.
 It does not g_source_unref(), so we keep the extra reference around
 until someone explicitly calls Gmain_timeout_remove().

 Talk about principle of least surprise :(

 Changing this behaviour to match glib's, i.e. unref'ing after
 g_source_attach, would seem like the correct thing to do,
 but is likely to break other pieces of code in subtle ways,
 so it may not be the right thing to do at this point.
 Thank you for detailed explanation.
 If you found the method that is appropriate than the correction that I 
 suggested, I approve of it.

 I'm going to take your patch more or less as is.
 If it does not show up soon, prod me again.

 All right.

 Many Thanks!
 Hideo Yamauchi.


 Thank you for tracking this down.

 Best Regards,
 Hideo Yamauchi.


 --- On Tue, 2012/5/1, 
 renayama19661...@ybb.ne.jprenayama19661...@ybb.ne.jp  wrote:

 Hi Lars,

 We confirmed that this problem occurred with v1 mode of Heartbeat.
* The problem happens with the v2 mode in the same way.

 We confirmed a problem in the next procedure.

 Step 1) Put a special device extinguishing a communication packet of 
 Heartbeat in the network.

 Step 2) Between nodes, the retransmission of the message is carried out 
 repeatedly.

 Step 3) Then the memory of the master process increases little by little.


  As a result of the ps command of the master process --
 * node1
 (start)
 32126 ?SLs0:00  0   182 53989  7128  0.0 heartbeat: master 
 control process
 (One hour later)
 32126 ?SLs0:03  0   182 54729  7868  0.0 heartbeat: master 
 control process
 (Two hour later)
 32126 ?SLs0:08  0   182 55317  8456  0.0 heartbeat: master 
 control process
 (Four hours later)
 32126 ?SLs0:24  0   182 56673  9812  0.0 heartbeat: master 
 control process

 * node2
 (start)
 31928 ?SLs0:00  0   182 53989  7128  0.0 heartbeat: master 
 control process
 (One hour later)
 31928 ?SLs0:02  0   182 54481  7620  0.0 heartbeat: master 
 control process
 (Two hour later)
 31928 ?SLs0:08  0   182 55353  8492  0.0 heartbeat: master 
 control process
 (Four hours later)
 31928 ?SLs0:23  0   182 56689  9828  0.0 heartbeat: master 
 control process


 The state of the memory leak seems to vary according to a node with the 
 quantity of the retransmission.

 The increase of this memory disappears by applying my patch.

 And the similar correspondence seems to be necessary in 
 send_reqnodes_msg(), but this is like little leak.

 Best Regards,
 Hideo Yamauchi.


 --- On Sat, 2012/4/28, 
 renayama19661...@ybb.ne.jprenayama19661...@ybb.ne.jp  wrote:

 Hi Lars,

 Thank you for comments.

 Have you actually been able to measure that memory leak you observed,
 and you can confirm this patch will fix it?

 Because I don't think this patch has any effect.
 Yes.
 I really measured leak.
 I can show a result next week.
 #Japan is a holiday until Tuesday.

Re: [Linux-HA] heartbeat strange behavior

2012-05-02 Thread Lars Ellenberg
On Mon, Apr 30, 2012 at 01:52:05PM -0300, Douglas Pasqua wrote:
 Hi friends,
 
 I create a linux ha solution using 2 nodes: node-a and node-b.
 
 My /etc/ha.d/ha.cf:
 
 use_logd yes
 keepalive 1
 deadtime 90
 warntime 5
 initdead 120
 bcast eth6
 node node-a
 node node-b
 crm off
 auto_failback off
 
 My /etc/ha.d/haresources
 node-a x.x.x.x/24 x.x.x.x/24 x.x.x.x/24 service1 service2 service3
 
 I booted the two nodes together. node-a become master and node-b become
 slave. After, I booted the node-a. Then node-b become master. When node-a
 return from boot, it become slave, because *auto_failback is off* i think.
 All as expected until here.
 
 As the node-a as a slave, I decide to halt the node-a (using halt command).
 Then heartbeat in node-b go standby and my cluster was down. The virtual
 ips was down too. I expected the node-b stay on. Why did this happen ?
 
 Some log from node2:
 
 Apr 30 00:02:57 node-b heartbeat: [3082]: info: Received shutdown notice
 from 'node-a'.
 Apr 30 00:02:57 node-b heartbeat: [3082]: info: Resources being acquired
 from node-a.
 Apr 30 00:02:57 node-b heartbeat: [4414]: debug: notify_world: setting
 SIGCHLD Handler to SIG_DFL
 Apr 30 00:02:57 node-b harc[4414]: [4428]: info: Running
 /etc/ha.d/rc.d/status status
 Apr 30 00:02:57 node-b heartbeat: [4416]: info: No local resources
 [/usr/share/heartbeat/ResourceManager listkeys node-b] to acquire.
 Apr 30 00:02:57 node-b heartbeat: [3082]: debug: StartNextRemoteRscReq():
 child count 1
 
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4657]: debug:
 /etc/init.d/asterisk  start done. RC=1
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4658]: ERROR: Return code 1
 from /etc/init.d/asterisk
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4659]: CRIT: Giving up
 resources due to failure of asterisk

Because of the above error when starting asterisk.  Maybe your asterisk
init script is simply not idempotent.  Maybe it is broken, or maybe
there really was some problem trying to start asterisk.


 Apr 30 00:02:58 node-b ResourceManager[4462]: [4660]: info: Releasing
 resource group: node-a x.x.x.x/24 x.x.x.x/24 x.x.x.x/24 asterisk
 sincronismo notificacao
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4670]: info: Running
 /etc/init.d/notificacao  stop
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4671]: debug: Starting
 /etc/init.d/notificacao  stop
 
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4694]: debug:
 /etc/init.d/notificacao  stop done. RC=0
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4704]: info: Running
 /etc/init.d/sincronismo  stop
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4705]: debug: Starting
 /etc/init.d/sincronismo  stop
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4711]: debug:
 /etc/init.d/sincronismo  stop done. RC=0
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4720]: info: Running
 /etc/init.d/asterisk  stop
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4721]: debug: Starting
 /etc/init.d/asterisk  stop
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4725]: debug:
 /etc/init.d/asterisk  stop done. RC=0
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4741]: info: Running
 /etc/ha.d/resource.d/IPaddr x.x.x.x/24 stop
 Apr 30 00:02:58 node-b ResourceManager[4462]: [4742]: debug: Starting
 /etc/ha.d/resource.d/IPaddr x.x.x.x/24 stop
 
 Apr 30 00:03:29 node-b heartbeat: [3082]: info: node-b wants to go standby
 [foreign]
 Apr 30 00:03:39 node-b heartbeat: [3082]: WARN: No reply to standby
 request.  Standby request cancelled.
 Apr 30 00:04:29 node-b heartbeat: [3082]: WARN: node node-a: is dead
 Apr 30 00:04:29 node-b heartbeat: [3082]: info: Dead node node-a gave up
 resources.
 Apr 30 00:04:29 node-b heartbeat: [3082]: info: Link node-a:eth6 dead.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] weird problem w/ R1

2012-05-02 Thread Dimitri Maziuk
Hi everyone,

I must be overlooking something obvious... I have a simple haresources
setup with

node drbddisk::sessdata Filesystem::/dev/drbd0::/raid::ext3::rw \
ip.addr httpd xinetd pure_ftpd pure_uploadscript bacula-client mon

bacula-client is in /etc/ha.d/resource.d, it's a copy of stock
/etc/init.d/bacula-fd with config, lock, and pid file changed to make it
listen on a non-standard port: this is for backing up drbd filesystem
(there's the standard bacula client running also).

bacula-client doesn't start. I added a couple of 'logger' lines and if I
manually run /etc/ha.d/resource.d/bacula-client start ; echo $? I get
0 and the log:
node logger: starting bacula-fd -c /etc/bacula/deposit-fd.conf
node logger: bacula-fd -c /etc/bacula/deposit-fd.conf running

Yet on failover I get this:

node ResourceManager[3734]: info: Running /etc/init.d/httpd  start
node ResourceManager[3734]: info: Running /etc/init.d/xinetd  start
node ResourceManager[3734]: info: Running /etc/ha.d/resource.d/pure_ftpd
 start
node xinetd[4204]: xinetd Version 2.3.14 started with libwrap loadavg
labeled-networking options compiled in.
node xinetd[4204]: Started working: 1 available service
node ResourceManager[3734]: info: Running
/etc/ha.d/resource.d/pure_uploadscript  start
node ResourceManager[3734]: info: Running /etc/init.d/mon  start

It doesn't seem to run that particular script: it starts
pure_uploadscript from resource.d and mon from init.d, but not the one
in between. What's weird is I now have it happening on 2 clusters:
centos 5 w/ heartbeat 2.1.4, and centos 6 w/ heartbeat 3.0.4. The only
common thing is bacula version: 5.

Any ideas?

TIA
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems