We're looking for some help here, contract or otherwise, or if someone can
point me to some comprehensive operational documentation...that would be
great.

Everything was working fine with our ganglia install for the last 4 months
until 2 weeks ago when we added and then removed 2 nodes.

I've verified that those servers are no longer running gmond.

I've verified that all nodes are forward and reverse DNS visible to each
other, and all the nodes in each cluster report the correct cluster name
from their local gmond. There are no duplicate cluster names, nor are there
nodes in more than one cluster.

NTP is in use on all these servers and their clocks are withing .05 seconds.


The symptoms I am seeing include:

Our gmond's report one set of information when I telnet to the local port,
but their headnodes and the gmetad in our grid report completely different
information.

1 or more hosts in a cluster lose track of 1 or more other members [TN= goes
beyond 20 and the graphs go completely out of whack]. Occasionally a gmond
stop + gmetad stop, wait ~5 minutes, turn them all back on, and things look
ok for ~10 minutes. Then some hosts are reported as having reported more
than 20 seconds ago.

I changed the location field in the gmond because I had a suspicion that
gmond was not reporting new data properly. Now when I look at the location
field on each gmond, it is as it should be, but on headnodes each of the
other nodes location fields will be "unspecified" *EXCEPT* occasionally when
some will be right and others wrong.

The gmetad server reports the correct location information for a few nodes,
but mostly not. It also has that bad reported time timeout information for
some nodes. A telnet to the local nodes gmond confirms that the gmetad does
not have it's latest report.


Is there a location that these processes might be stashing old data or
retrieving bad data?



We've got a single gmetad in the default configuration, with the only
deviations being:

scalable on
authority "http://ganglia.twitter.com";
gridname "Twitter"
setuid off

Our gmond configuration looks like:

/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
  daemonize = yes
  setuid = yes
  user = ganglia
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  host_dmax = 86400 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no

}

/* If a cluster attribute is specified, then all gmond hosts are wrapped
inside
 * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all
<HOSTS> will
 * NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
  name = "<%= ganglia_node_type %>"
  owner = "Twitter, Inc."
  latlong = "unspecified"
  url = "http://ganglia.twitter.com";
}

/* The host section describes attributes of the host, like the location */
host {
  location = "NTTA Lundy Fremont CA"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  mcast_join = 239.2.11.71
  port       = <%= ganglia_send_port %>
  ttl        = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  mcast_join = 239.2.11.71
  port       = <%= ganglia_recv_port %>
}

tcp_accept_channel {
  interface = eth1
  port = 8649
}

/* The old internal 2.5.x metric array has been replaced by the following
   collection_group directives.  What follows is the default behavior for
   collecting and sending metrics that is as close to 2.5.x behavior as
   possible. */

/* This collection group will cause a heartbeat (or beacon) to be sent every

   20 seconds.  In the heartbeat is the GMOND_STARTED data which expresses
   the age of the running gmond. */
collection_group {
  collect_once = yes
  time_threshold = 20
  metric {
    name = "heartbeat"
  }
}

/* This collection group will send general info about this host every 1200
secs.
   This information doesn't change between reboots and is only collected
once. */
collection_group {
  collect_once = yes
  time_threshold = 1200
  metric {
    name = "cpu_num"
  }
  metric {
    name = "cpu_speed"
  }
  metric {
    name = "mem_total"
  }
  /* Should this be here? Swap can be added/removed between reboots. */
  metric {
    name = "swap_total"
  }
  metric {
    name = "boottime"
  }
  metric {
    name = "machine_type"
  }
  metric {
    name = "os_name"
  }
  metric {
    name = "os_release"
  }
  metric {
    name = "location"
  }
}

/* This collection group will send the status of gexecd for this host every
300 secs */
/* Unlike 2.5.x the default behavior is to report gexecd OFF.  */
collection_group {
  collect_once = yes
  time_threshold = 300
  metric {
    name = "gexec"
  }
}

/* This collection group will collect the CPU status info every 20 secs.
   The time threshold is set to 90 seconds.  In honesty, this time_threshold
could be
   set significantly higher to reduce unneccessary network chatter. */
collection_group {
  collect_every = 20
  time_threshold = 90
  /* CPU status */
  metric {
    name = "cpu_user"
    value_threshold = "1.0"
  }
  metric {
    name = "cpu_system"
    value_threshold = "1.0"
  }
  metric {
    name = "cpu_idle"
    value_threshold = "5.0"
  }
  metric {
    name = "cpu_nice"
    value_threshold = "1.0"
  }
  metric {
    name = "cpu_aidle"
    value_threshold = "5.0"
  }
  metric {
    name = "cpu_wio"
    value_threshold = "1.0"
  }
  /* The next two metrics are optional if you want more detail...
     ... since they are accounted for in cpu_system.
  metric {
    name = "cpu_intr"
    value_threshold = "1.0"
  }
  metric {
    name = "cpu_sintr"
    value_threshold = "1.0"
  }
  */
}

collection_group {
  collect_every = 20
  time_threshold = 90
  /* Load Averages */
  metric {
    name = "load_one"
    value_threshold = "1.0"
  }
  metric {
    name = "load_five"
    value_threshold = "1.0"
  }
  metric {
    name = "load_fifteen"
    value_threshold = "1.0"
  }
}

/* This group collects the number of running and total processes */
collection_group {
  collect_every = 80
  time_threshold = 950
  metric {
    name = "proc_run"
    value_threshold = "1.0"
  }
  metric {
    name = "proc_total"
    value_threshold = "1.0"
  }
}

/* This collection group grabs the volatile memory metrics every 40 secs and

   sends them at least every 180 secs.  This time_threshold can be increased

   significantly to reduce unneeded network traffic. */
collection_group {
  collect_every = 40
  time_threshold = 180
  metric {
    name = "mem_free"
    value_threshold = "1024.0"
  }
  metric {
    name = "mem_shared"
    value_threshold = "1024.0"
  }
  metric {
    name = "mem_buffers"
    value_threshold = "1024.0"
  }
  metric {
    name = "mem_cached"
    value_threshold = "1024.0"
  }
  metric {
    name = "swap_free"
    value_threshold = "1024.0"
  }
}

collection_group {
  collect_every = 40
  time_threshold = 300
  metric {
    name = "bytes_out"
    value_threshold = 4096
  }
  metric {
    name = "bytes_in"
    value_threshold = 4096
  }
  metric {
    name = "pkts_in"
    value_threshold = 256
  }
  metric {
    name = "pkts_out"
    value_threshold = 256
  }
}

/* Different than 2.5.x default since the old config made no sense */
collection_group {
  collect_every = 1800
  time_threshold = 3600
  metric {
    name = "disk_total"
    value_threshold = 1.0
  }
}

collection_group {
  collect_every = 40
  time_threshold = 180
  metric {
    name = "disk_free"
    value_threshold = 1.0
  }
  metric {
    name = "part_max_used"
    value_threshold = 1.0
  }
}
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to