[Ganglia-general] Ganglia nodes heartbeat error after ganglia web vm move

Konrad, Karl-Heinz Mon, 13 Aug 2018 13:53:53 -0700

Hi All,
I have recently upgraded my ESX host to 6.7.  This involved my migrating the VM 
to another ESX host and then migrating the ganglia web VM back to the original 
host.  I made no changes on my nodes, nor did I make any changes on the ganglia 
web VM other than migration.  After the move, I am receiving a heartbeat error 
on gmond.  The curious thing is one of my grids displays successfully, which is 
remote, while the local grid does not display properly.


Here are my services on the Ganglia web server:
[root@ca6web ~]# rpm -qa|grep gmond
ganglia-gmond-3.7.2-2.el7.x86_64
[root@ca6web ~]# systemctl status gmond
â gmond.service - Ganglia Monitoring Daemon
   Loaded: loaded (/usr/lib/systemd/system/gmond.service; enabled; vendor 
preset: disabled)
   Active: active (running) since Fri 2018-08-03 14:33:05 PDT; 1 weeks 2 days 
ago
  Process: 6447 ExecStart=/usr/sbin/gmond (code=exited, status=0/SUCCESS)
Main PID: 6448 (gmond)
   CGroup: /system.slice/gmond.service
           ââ6448 /usr/sbin/gmond

Aug 03 14:33:05 ca6web.wai.com systemd[1]: Starting Ganglia Monitoring Daemon...
Aug 03 14:33:05 ca6web.wai.com systemd[1]: Started Ganglia Monitoring Daemon.
Hint: Some lines were ellipsized, use -l to show in full.
[root@ca6web ~]# systemctl status gmetad
â gmetad.service - Ganglia Meta Daemon
   Loaded: loaded (/usr/lib/systemd/system/gmetad.service; enabled; vendor 
preset: disabled)
   Active: active (running) since Fri 2018-08-03 14:15:16 PDT; 1 weeks 2 days 
ago
Main PID: 5408 (gmetad)
   CGroup: /system.slice/gmetad.service
           ââ5408 /usr/sbin/gmetad -d 1

Aug 03 14:15:16 ca6web.wai.com systemd[1]: Starting Ganglia Meta Daemon...
Aug 03 14:15:16 ca6web.wai.com gmetad[5408]: Sources are ...
Aug 03 14:15:16 ca6web.wai.com gmetad[5408]: Source: [NM1, step 60] has 1 so...s
Aug 03 14:15:16 ca6web.wai.com gmetad[5408]: xxx.xxx.xxx.xxx
Aug 03 14:15:16 ca6web.wai.com gmetad[5408]: Source: [CA6, step 15] has 1 so...s
Aug 03 14:15:16 ca6web.wai.com gmetad[5408]: 127.0.0.1
Aug 03 14:15:16 ca6web.wai.com gmetad[5408]: Data thread 139791241848576 is ...e
Aug 03 14:15:16 ca6web.wai.com gmetad[5408]: xxx.xxx.xxx.xxx
Aug 03 14:15:16 ca6web.wai.com gmetad[5408]: Data thread 139791233455872 is ...e
Aug 03 14:15:16 ca6web.wai.com gmetad[5408]: 127.0.0.1
Hint: Some lines were ellipsized, use -l to show in full.

Here is the status result from one of my nodes:
[root@ca6node6 ~]# systemctl status gmond -l
â gmond.service - Ganglia Monitoring Daemon
   Loaded: loaded (/usr/lib/systemd/system/gmond.service; enabled; vendor 
preset: disabled)
   Active: active (running) since Fri 2018-08-03 13:30:09 PDT; 1 weeks 2 days 
ago
  Process: 23914 ExecStart=/usr/sbin/gmond (code=exited, status=0/SUCCESS)
Main PID: 23915 (gmond)
   CGroup: /system.slice/gmond.service
           ââ23915 /usr/sbin/gmond

Aug 13 13:03:33 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for heartbeat
Aug 13 13:03:43 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for cpu_num
Aug 13 13:03:53 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for heartbeat
Aug 13 13:04:13 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for heartbeat
Aug 13 13:04:23 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for proc_run
Aug 13 13:04:33 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for heartbeat
Aug 13 13:04:43 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for cpu_num
Aug 13 13:04:53 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for heartbeat
Aug 13 13:05:13 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for heartbeat
Aug 13 13:05:33 ca6node6.wai.com /usr/sbin/gmond[23915]: Error 1 sending the 
modular data for heartbeat

Here is the gmond.conf file from my nodes.  I use unicast for the gmond daemon.

/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
  daemonize = yes
  setuid = yes
  user = nobody
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
 deaf = yes
  allow_extra_data = yes
  host_dmax = 3600 /*secs. Expires (removes from web interface) hosts in 1 hour 
*/
  host_tmax = 20 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  # By default gmond will use reverse DNS resolution when displaying your 
hostname
  # Uncommeting following value will override that value.
  # override_hostname = "mywebserver.domain.com"
  # If you are not using multicast this value should be set to something other 
than 0.s
  # Otherwise if you restart aggregator gmond you will get empty graphs. 60 
seconds is reasonable
  send_metadata_interval = 60 /*secs */

}

/*
* The cluster attributes specified will be used as part of the <CLUSTER>
* tag that will wrap all hosts collected by this instance.
*/
cluster {
  name = "CA6"
  owner = "My company"
  latlong = "Lat and Long"
  url = "http://ganglia.wai.com";
}

/* The host section describes attributes of the host, like the location */
host {
  location = "server room"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond used to 
only support having a single channel */
udp_send_channel {
  bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
  host = Web server IP
  port = 8649
  ttl = 1
}

Here is the gmond.conf from the Web server:
/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
  daemonize = yes
  setuid = yes
  user = nobody
  debug_level = 0
  max_udp_msg_len = 1472
  mute = yes
  deaf = no
  allow_extra_data = yes
  host_dmax = 3600 /*secs. Expires (removes from web interface) hosts in 1 hour 
*/
  host_tmax = 20 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  # By default gmond will use reverse DNS resolution when displaying your 
hostname
  # Uncommeting following value will override that value.
  # override_hostname = "mywebserver.domain.com"
  # If you are not using multicast this value should be set to something other 
than 0.
  # Otherwise if you restart aggregator gmond you will get empty graphs. 60 
seconds is reasonable
  send_metadata_interval = 60 /*secs */

}

/*
* The cluster attributes specified will be used as part of the <CLUSTER>
* tag that will wrap all hosts collected by this instance.
*/
cluster {
  name = "CA6"
  owner = "My name"
  latlong = "Lat and long"
  url = "http://ganglia.wai.com";
}

/* The host section describes attributes of the host, like the location */
host {
  location = "Server Room"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond   used to 
only support having a single channel */
udp_send_channel {
  bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
  host = xxx.xxx.xxx.xxx
  port = 8649
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  port = 8649
  # Size of the UDP buffer. If you are handling lots of metrics you really
  # should bump it up to e.g. 10MB or even higher.
  # buffer = 10485760
}

Here is the gmetad.conf file from the Web server:
data_source "CA6" 127.0.0.1:8649
data_source "NM1" 60 xxx.xxx.xxx.xxx:8650
gridname "Thornton Tomasetti"

All others are defaults.

I have uninstalled and reinstalled the rpm to no avail.

I can successfully connect to the web server:
[root@ca6node1 ~]# nc -uv ca6web 8649
Ncat: Version 6.40 ( http://nmap.org/ncat )
Ncat: Connected to xxx.xxx.xxx.xxx:8649.

I am at my wits end.  This configuration has been running successfully for at 
least a year.  The really strange thing is that the data is being collected 
from my remote node successfully and being displayed properly.

Any help is appreciated.


Karl-Heinz Konrad
Consultant
Information Technology
Thornton Tomasetti
19200 Stevens Creek Blvd., Suite 100
Cupertino, CA 95014
T +1.650.230.0210    F +1.650.230.0209
D +1.650.230.0262   M +1.831.246.1687
kkon...@thorntontomasetti.com<mailto:karl-heinz.kon...@wai.com>
www.ThorntonTomasetti.com<http://www.thorntontomasetti.com/>

The information in this email and any attachments may contain confidential 
information that is intended solely for the attention and use of the named 
addressee(s). This message or any part thereof must not be disclosed, copied, 
distributed or retained by any person without authorization from the addressee. 
If you are not the intended addressee, please notify the sender immediately, 
and delete this message.


The information in this email and any attachments may contain confidential 
information that is intended solely for the attention and use of the named 
addressee(s). This message or any part thereof must not be disclosed, copied, 
distributed or retained by any person without authorization from the addressee. 
If you are not the intended addressee, please notify the sender immediately, 
and delete this message.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

[Ganglia-general] Ganglia nodes heartbeat error after ganglia web vm move

Reply via email to