[Ganglia-general] gmond occasionally doesn't connect up in unicast

2014-11-12 Thread Sam Barham
We've got about 100 machines running on AWS EC2s, with Ganglia for
monitoring.  Because we are on Amazon, we can't use multicast, so the
architecture we have is each cluster has a Bastion machine, and each other
machine in the cluster has gmond send its' data to the bastion, which
gmetad then queries.  All standard and sensible and it works just fine.

Except that occasionally, when I redeploy the machines in a cluster (but
not the bastion - that stays running through this operation), just one of
the machines will not send data through to the bastion or something.  All I
can say for sure is that gmond is running OK on the problem machine, there
are no error logs on the problem machine, the bastion or the gmetad
machine, but the machine doesn't appear in gmetad.  If I go into the
problem machine and restart gmond, it reconnects just fine and appears in
gmetad.

Which machine has the error is random - it's not a particular type of
machine or anything.  Because the error only shows up rarely, and only at
deployment time, I can't really turn on debug_level to investigate.

Also, some of the configuration values in gmond.conf are filled in when the
userdata is run.  I've edited /etc/init.d/ganglia-monitor so that it starts
up immediately after the userdata has run, just in case that matters.

Any ideas?

Sam
--
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111iu=/4140/ostg.clktrk___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond occasionally doesn't connect up in unicast

2014-11-12 Thread Sam Barham
Until recently I wasn't controlling the start order of ec2-run-user-data
and ganglia-monitor, so they were starting at the same 'time'.  Yesterday I
fixed that, so that now ec2-run-user-data starts at S02 and ganglia-monitor
at S03.  I thought the issue might be exactly what you describe -
ganglia-monitor starting before ec2-run-user-data has finished altering the
gmond.conf, but the error still happened today.

Also, I suspect (but don't know for sure) that the gmond.conf will actually
be invalid before ec2-run-user-data has run - I've altered it to have flags
that get replaced with valid values.

On Thu, Nov 13, 2014 at 12:20 PM, Joe Gracyk jgra...@marketlive.com wrote:

 Hi, Sam -

 We've got a similar deployment (EC2 instances unicasting to a per-AZ
 gmetad) that we're managing with Puppet, and I can't say we've seen
 anything like that.

 How are you automating your redeployments and gmond configurations? Could
 your gmond instances be starting up before their unicast configurations
 have been applied? If you had some sort of race condition where gmond could
 be installed and started, and *then *getting the conf file written, I'd
 expect gmond to merrily chug along, fruitlessly trying to multicast into
 the void.

 Good luck!

 On Wed, Nov 12, 2014 at 2:41 PM, Sam Barham s.bar...@adinstruments.com
 wrote:

 We've got about 100 machines running on AWS EC2s, with Ganglia for
 monitoring.  Because we are on Amazon, we can't use multicast, so the
 architecture we have is each cluster has a Bastion machine, and each other
 machine in the cluster has gmond send its' data to the bastion, which
 gmetad then queries.  All standard and sensible and it works just fine.

 Except that occasionally, when I redeploy the machines in a cluster (but
 not the bastion - that stays running through this operation), just one of
 the machines will not send data through to the bastion or something.  All I
 can say for sure is that gmond is running OK on the problem machine, there
 are no error logs on the problem machine, the bastion or the gmetad
 machine, but the machine doesn't appear in gmetad.  If I go into the
 problem machine and restart gmond, it reconnects just fine and appears in
 gmetad.

 Which machine has the error is random - it's not a particular type of
 machine or anything.  Because the error only shows up rarely, and only at
 deployment time, I can't really turn on debug_level to investigate.

 Also, some of the configuration values in gmond.conf are filled in when
 the userdata is run.  I've edited /etc/init.d/ganglia-monitor so that it
 starts up immediately after the userdata has run, just in case that matters.

 Any ideas?

 Sam


 --
 Comprehensive Server Monitoring with Site24x7.
 Monitor 10 servers for $9/Month.
 Get alerted through email, SMS, voice calls or mobile push notifications.
 Take corrective actions from your mobile device.

 http://pubads.g.doubleclick.net/gampad/clk?id=154624111iu=/4140/ostg.clktrk
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general




 --

 [image: logo] http://www.marketlive.com/

 Joe Gracyk | *DevOps Developer*
 707-780-1848 | jgra...@marketlive.com

 [image: Follow us on Facebook] http://www.facebook.com/marketlive
 https://twitter.com/marketliveinc
 http://www.linkedin.com/company/marketlive
 http://www.marketlive-blog.com/ http://www.marketlive.com/summit2015/

--
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111iu=/4140/ostg.clktrk___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond occasionally doesn't connect up in unicast

2014-11-12 Thread Joe Gracyk
Hi, Sam -

We've got a similar deployment (EC2 instances unicasting to a per-AZ
gmetad) that we're managing with Puppet, and I can't say we've seen
anything like that.

How are you automating your redeployments and gmond configurations? Could
your gmond instances be starting up before their unicast configurations
have been applied? If you had some sort of race condition where gmond could
be installed and started, and *then *getting the conf file written, I'd
expect gmond to merrily chug along, fruitlessly trying to multicast into
the void.

Good luck!

On Wed, Nov 12, 2014 at 2:41 PM, Sam Barham s.bar...@adinstruments.com
wrote:

 We've got about 100 machines running on AWS EC2s, with Ganglia for
 monitoring.  Because we are on Amazon, we can't use multicast, so the
 architecture we have is each cluster has a Bastion machine, and each other
 machine in the cluster has gmond send its' data to the bastion, which
 gmetad then queries.  All standard and sensible and it works just fine.

 Except that occasionally, when I redeploy the machines in a cluster (but
 not the bastion - that stays running through this operation), just one of
 the machines will not send data through to the bastion or something.  All I
 can say for sure is that gmond is running OK on the problem machine, there
 are no error logs on the problem machine, the bastion or the gmetad
 machine, but the machine doesn't appear in gmetad.  If I go into the
 problem machine and restart gmond, it reconnects just fine and appears in
 gmetad.

 Which machine has the error is random - it's not a particular type of
 machine or anything.  Because the error only shows up rarely, and only at
 deployment time, I can't really turn on debug_level to investigate.

 Also, some of the configuration values in gmond.conf are filled in when
 the userdata is run.  I've edited /etc/init.d/ganglia-monitor so that it
 starts up immediately after the userdata has run, just in case that matters.

 Any ideas?

 Sam


 --
 Comprehensive Server Monitoring with Site24x7.
 Monitor 10 servers for $9/Month.
 Get alerted through email, SMS, voice calls or mobile push notifications.
 Take corrective actions from your mobile device.

 http://pubads.g.doubleclick.net/gampad/clk?id=154624111iu=/4140/ostg.clktrk
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general




-- 

[image: logo] http://www.marketlive.com/

Joe Gracyk | *DevOps Developer*
707-780-1848 | jgra...@marketlive.com

[image: Follow us on Facebook] http://www.facebook.com/marketlive
https://twitter.com/marketliveinc
http://www.linkedin.com/company/marketlive
http://www.marketlive-blog.com/ http://www.marketlive.com/summit2015/
--
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111iu=/4140/ostg.clktrk___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general