On Fri, Jun 13, 2008 at 09:57:23AM -0600, Brad Nicholes wrote:
> What is the status of this patch?  Does it need to be proposed for backport?  
> If so, 3.1.x and 3.0.x?

the implementations suggested at the end of these email are not yet complete
(attached patch which removes the "four times" hardcoded value but is still
missing the configuration check to validate the heartbeat is smaller and
preferably 4 times smaller than tmax if using multicast)

testing (specially in multicast setups and with different settings per host in
the same cluster) encouraged.

Carlo
> 
> >>> On 6/3/2008 at 5:39 AM, in message <[EMAIL PROTECTED]>, Carlo
> Marcelo Arenas Belon <[EMAIL PROTECTED]> wrote:
> > On Tue, Jun 03, 2008 at 11:36:55AM +0200, Sebastien Piechurski wrote:
> >> Here is the same patch made against revision 1373 of the svn repository.
> > 
> > just one minor caveat, for consistency (not that it will make any difference
> > though, as the value is overloaded anyway), the default from libgmond should
> > match the one from gmond.c and from the embedded configuration (copied 
> > excerpt
> > below from the patch, as it was not possible to reply to your patch directly
> > because it was attached and not inlined)
> > 
> > Index: lib/libgmond.c
> > ===================================================================
> > --- lib/libgmond.c  (revision 1375)
> > +++ lib/libgmond.c  (working copy)
> > @@ -57,6 +57,7 @@
> >    CFG_BOOL("mute", 0, CFGF_NONE),
> >    CFG_BOOL("deaf", 0, CFGF_NONE),
> >    CFG_INT("host_dmax", 0, CFGF_NONE),
> > +  CFG_INT("host_tmax", 20, CFGF_NONE),
> >    CFG_INT("cleanup_threshold", 300, CFGF_NONE),
> >    CFG_BOOL("gexec", 0, CFGF_NONE),
> >    CFG_INT("send_metadata_interval", 0, CFGF_NONE),
> > 
> >> I also added some explanations to the documentation, but am not sure  
> >> the explanation is really clear nor exact.
> > 
> > it is correct, if a little strange, but will let any native speaker correct
> > the grammar if needed.
> > 
> > Committed revision 1376
> > 
> > I also suspect that it will be probably better to also change the logic for
> > gmond so that it will use the host_tmax value directly instead of a multiple
> > of it with a hardcoded "4 attempts" to avoid confusion, and most likely also
> > add some verification and documentation in the relationship that should be
> > maintained between this value and the heartbeat, to avoid broken setups.
> > 
> > Carlo
Index: gmetad-python/Gmetad/gmetad_data.py
===================================================================
--- gmetad-python/Gmetad/gmetad_data.py (revision 1420)
+++ gmetad-python/Gmetad/gmetad_data.py (working copy)
@@ -117,7 +117,7 @@
                     hostNode.lastReportedTime = reportedTime
                     
                 try:
-                    if clusterUp and (int(hostNode.getAttr('tn')) < 
int(hostNode.getAttr('tmax'))*4):
+                    if clusterUp and (int(hostNode.getAttr('tn')) < 
int(hostNode.getAttr('tmax'))):
                         clusterNode.summaryData['hosts_up'] += 1
                     else:
                         clusterNode.summaryData['hosts_down'] += 1
Index: gmetad/process_xml.c
===================================================================
--- gmetad/process_xml.c        (revision 1420)
+++ gmetad/process_xml.c        (working copy)
@@ -420,7 +420,7 @@
     */
    xmldata->host_alive = (xmldata->old || !tmax) ?
       abs(xmldata->source.localtime - reported) < 60 :
-      tn < tmax * 4;
+      tn < tmax;
 
    if (xmldata->host_alive)
       xmldata->source.hosts_up++;
Index: lib/default_conf.h.in
===================================================================
--- lib/default_conf.h.in       (revision 1420)
+++ lib/default_conf.h.in       (working copy)
@@ -15,7 +15,7 @@
   mute = no             \n\
   deaf = no             \n\
   host_dmax = 0 /*secs */ \n\
-  host_tmax = 20 /*secs */ \n\
+  host_tmax = 80 /*secs */ \n\
   cleanup_threshold = 300 /*secs */ \n\
   gexec = no             \n\
   send_metadata_interval = 0     \n\
Index: lib/libgmond.c
===================================================================
--- lib/libgmond.c      (revision 1420)
+++ lib/libgmond.c      (working copy)
@@ -58,7 +58,7 @@
   CFG_BOOL("mute", 0, CFGF_NONE),
   CFG_BOOL("deaf", 0, CFGF_NONE),
   CFG_INT("host_dmax", 0, CFGF_NONE),
-  CFG_INT("host_tmax", 20, CFGF_NONE),
+  CFG_INT("host_tmax", 80, CFGF_NONE),
   CFG_INT("cleanup_threshold", 300, CFGF_NONE),
   CFG_BOOL("gexec", 0, CFGF_NONE),
   CFG_INT("send_metadata_interval", 0, CFGF_NONE),
Index: gmond/gmond.c
===================================================================
--- gmond/gmond.c       (revision 1420)
+++ gmond/gmond.c       (working copy)
@@ -64,7 +64,7 @@
 /* The number of seconds to hold "dead" hosts in the hosts hash */
 int host_dmax = 0;
 /* The number of seconds to wait for a message before considering it down */
-int host_tmax = 20;
+int host_tmax = 80;
 /* The amount of time between cleanups */
 int cleanup_threshold = 300;
 /* Time interval before send another metadata packet */
Index: gmond/conf.pod
===================================================================
--- gmond/conf.pod      (revision 1420)
+++ gmond/conf.pod      (working copy)
@@ -94,7 +94,7 @@
     setuid = true
     user = nobody
     host_dmax = 3600
-    host_tmax = 40
+    host_tmax = 120
   }
 
 The B<daemonize> attribute is a boolean.  When true, B<gmond> will 
@@ -123,11 +123,11 @@
 
 The B<host_tmax> value is an integer with units in seconds. This value
 represents the amount of time after which B<gmond> should have received
-updates from a host. As messages may get lost in the network, B<gmond>
-will consider the host as being down if it has not received any messages
-from it after 4 times this value. For example, if B<host_tmax> is set 
-to 20, the host will appear as down after 80 seconds with no messages
-from it. By the way, tmax means "timeout max".
+updates from a host and will consider the host as being down.  This
+value should be larger than the heartbeat and should account for the 
+possibility of packet loss, specially when using multicast, it is
+recommended to be at least 4 times the heartbeat value and was historically
+hardcoded to 80.  By the way, tmax means "timeout max".
 
 The B<cleanup_threshold> is the minimum about of time before gmond
 will cleanup and hosts or metrics where B<tn> > B<dmax> a.k.a. expired
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to