Re: [Ganglia-developers] send_metadata_interval

2011-01-11 Thread Brad Nicholes
 On 1/10/2011 at 4:52 PM, in message
aanlktinbmlmnbcti3q-sjuocmp=+igaggo0trj3gf...@mail.gmail.com, Bernard Li
bern...@vanhpc.org wrote:
 Hi Brad:
 
 Thanks for your reply.
 
 On Mon, Jan 10, 2011 at 8:06 AM, Brad Nicholes bnicho...@novell.com wrote:
 
 The purpose of setting the send_metadata_interval to 0 by default was to 
 avoid unnecessary traffic for our default configuration of multicast.  
 Setting the directive to anything other than 0 will cause each gmond to start 
 sending all of its metric metadata on that interval.  If you are going to set 
 it by default, IMO 30 seconds is too low.  The problem is that people only 
 notice this in the first few minutes after restarting a gmond.  They expect 
 metrics to start showing up immediately.  After the gmond node finally does 
 send its metadata, rebroadcasting the metadata at any interval is just 
 consuming unnecessary bandwidth on the network.  Especially in a multicast 
 environment where it isn't needed at all.  Also consider that the more gmond 
 nodes you have the more traffic you are going to but on the network where 99% 
 of the time the extra traffic is totally unnecessary.
 
 I have a perhaps naive question.  It sounds like
 send_metadata_interval is only relevant to unicast configuration, so
 why is multicast affected as well?  How difficult of a code change
 would it be if we make the send_metadata_interval directive to only
 affect unicast?
 

We could add code to gmond to always disable resending metadata based on an 
interval.  But then that is what the default value of 0 was doing.  


 Also multicast is the default configuration due to historic reasons
 but not because it is more common.  It is however easier to set up if
 your environment supports it.  Is it time for us to evaluate whether
 we should switch to unicast as the default?  And if so how?  What is
 the actual spread between unicast and multicast users?  If it turns
 out that the majority of our (new) users are using unicast, should we
 spend more time/effort making it easier for them to use Ganglia?
 

Actually I think this is a good idea.  In my experience, unicast seems to be 
more the norm rather than the exception now.  If we were to make unicast the 
default, then that would make the suggestion above more relevant.  We would 
probably want to put something in the code to automatically disable the send 
metadata for multicast.


 300 or 600 seconds is probably good enough for a default.  But no matter 
 what the default is, users still have to understand what that directive is 
 for and how to optimize it.  The value of send_metadata_interval will 
 probably be different for every installation when you take into consideration 
 the number of nodes, the number of metrics and any other network related 
 variables.
 
 A couple more ideas came out of a brief brainstorming session on IRC
 between Vladimir, Jesse and myself:
 
 1) Collector gmond should request metadata from all gmonds when it has
 been freshly (re)started

This already happens in multicast mode.  Whenever a gmond node receives a 
metric packet for which it has no metadata, it automatically sends out a 
request on the channel for metadata.  The end result is that all gmond nodes 
are constantly resyncing themselves until all nodes in a cluster have a 
complete metadata picture.  However, the same can not be done for unicast 
because, by definition, there is no two-way communication.  In order to make 
the same functionality work for unicast, we would have to introduce a new 
listen port on every gmond that would accept commands and respond to whatever 
they are.  Doing that opens up a security risk that would have to be dealt with 
correctly. 

 2) Add a configuration check for gmond so upon starting, if
 configuration is unicast-based, and send_metadata_interval is 0, warn
 the user to set it to a sane number

This would be a good idea no matter what else we do.  

 3) Find a middle ground of default send_metadata_interval which does
 not hurt new users in HPC space wanting to use unicast
 
 2) and 3) are workarounds which could be implemented relatively
 quickly, 1) maybe not so much.

agreed

Brad

--
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to 
best implement a security strategy that keeps consumers' information secure 
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl 
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] send_metadata_interval

2011-01-11 Thread Vladimir Vuksan

On Mon, 10 Jan 2011 15:52:50 -0800, Bernard Li bern...@vanhpc.org wrote:

 I have a perhaps naive question.  It sounds like
 send_metadata_interval is only relevant to unicast configuration, so
 why is multicast affected as well?  How difficult of a code change
 would it be if we make the send_metadata_interval directive to only
 affect unicast?
 
 Also multicast is the default configuration due to historic reasons
 but not because it is more common.  It is however easier to set up if
 your environment supports it.  Is it time for us to evaluate whether
 we should switch to unicast as the default?  And if so how?  What is
 the actual spread between unicast and multicast users?  If it turns
 out that the majority of our (new) users are using unicast, should we
 spend more time/effort making it easier for them to use Ganglia?
 
 300 or 600 seconds is probably good enough for a default.  But no
matter
 what the default is, users still have to understand what that directive
 is for and how to optimize it.  The value of send_metadata_interval
will
 probably be different for every installation when you take into
 consideration the number of nodes, the number of metrics and any other
 network related variables.
 
 A couple more ideas came out of a brief brainstorming session on IRC
 between Vladimir, Jesse and myself:
 
 1) Collector gmond should request metadata from all gmonds when it has
 been freshly (re)started
 2) Add a configuration check for gmond so upon starting, if
 configuration is unicast-based, and send_metadata_interval is 0, warn
 the user to set it to a sane number
 3) Find a middle ground of default send_metadata_interval which does
 not hurt new users in HPC space wanting to use unicast
 
 2) and 3) are workarounds which could be implemented relatively
 quickly, 1) maybe not so much.



I think send_metadata_interval would also be a problem if you set all your
agents to be deaf except the collector node(s). I have done just that for
security reasons.

Vladimir

--
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand 
malware threats, the impact they can have on your business, and how you 
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] send_metadata_interval

2011-01-11 Thread Bernard Li
Hi Brad:

On Tue, Jan 11, 2011 at 7:33 AM, Brad Nicholes bnicho...@novell.com wrote:

 Actually I think this is a good idea.  In my experience, unicast seems to be 
 more the norm rather than the exception now.  If we were to make unicast the 
 default, then that would make the suggestion above more relevant.  We would 
 probably want to put something in the code to automatically disable the send 
 metadata for multicast.

I'd like to clarify a few points.

Right now if we are using the default multicast setting, if
send_metadata_interval directive is omitted, is it set to 0 and thus
metadata re-sending by interval is suppressed?  If so, I would suggest
the following:

1) Do NOT set send_metadata_interval in gmond.conf (we could add a
comment if so desired)
2) Add check in libconfuse parsing of gmond.conf -- if host and port
are specified (meaning unicast), send_metadata_interval must be = 0,
if not a warning message is displayed and gmond is not started
3) Perhaps move the send_metadata_interval directive from the global
section to each udp_send_channel section?

My $0.02.

Thanks,

Bernard

--
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand 
malware threats, the impact they can have on your business, and how you 
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] send_metadata_interval

2011-01-10 Thread Brad Nicholes
 On 1/7/2011 at 9:10 PM, in message
aanlktikfk_hy2v_zvkb_pra6vxmeqnv3nw3iokhxx...@mail.gmail.com, Jesse Becker
haw...@gmail.com wrote:
 On Fri, Jan 7, 2011 at 15:25, Bernard Li bern...@vanhpc.org wrote:
 Hi all:

 Since the release of Ganglia 3.1, we have introduced the new
 configuration option send_metadata_interval in gmond.conf.  This is
 set to 0 by default and the user must set this to a sane number if
 using unicast otherwise if gmonds are restarted, hosts may appear to
 be offline (this is documented in the release notes).  A bug has
 already been filed:

 http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=242 

 We recently have a lot of users having this issue and Vladimir
 recommend that we just set a sane number as the default and be done
 with it, since we end up spending a lot of time on IRC/mailing-list to
 solve the same problem over and over again.

 Since there have been some commits to the 3.1 branch since tagging
 3.1.7, I propose we just copy 3.1.7 tag, update the send_meta_data
 interval in the configuration file and release that as 3.1.8.

 This is not the normal procedure for making a release, so I'd like to
 get some feedback from other developers.

 BTW I am thinking of setting send_metadata_interval to 30 seconds.
 Also, does anybody know if this setting affects multicast setups in
 any way?
 
 I think that it's fine to set this to a non-zero value, but I wonder
 if 30 seconds is too high.  I did a quick set of checking on the
 actual packets that are sent--and specifically the metadata packets.
 I haven't been able to really delve into the code to figure exactly
 what's going on (this part of the code is't terribly transparent to
 me), but I *think* that they are really large--on the order of several
 KB when fully assembled, as compared to less than 100-120 bytes for a
 typical metric packet .  I think that size will increase with the
 number of metrics stored, since each one must be described in full XML
 each time.
 
 The reason for the large size is that an entire XML description of the
 metrics appears to be sent each time.  Metadata packets also appear to
 go over TCP, not UDP.
 
 My testing was pretty simple:
 1) setup a gmond (from SVN, well after 3.1 came out) in unicast mode.
 2) set 'send_metadata_interfaval' to 1
 3) disable all modules, except for 'mod_core'
 4) remove all collection groups.
 5) start gmond, and run tcpdump.
 
 On a large cluster, with lots of metrics per host, I can see problems
 if the metadata packets are sent too frequently.  I have hosts that
 send well over 300 metrics (lots of CPU cores makes for lots of
 metrics...).  Each of these need to be described in the metadata
 packets.
 
 So I think that setting a non-zero default is fine.  But think that
 something like 300 or 600 seconds would be preferable.
 

The purpose of setting the send_metadata_interval to 0 by default was to avoid 
unnecessary traffic for our default configuration of multicast.  Setting the 
directive to anything other than 0 will cause each gmond to start sending all 
of its metric metadata on that interval.  If you are going to set it by 
default, IMO 30 seconds is too low.  The problem is that people only notice 
this in the first few minutes after restarting a gmond.  They expect metrics to 
start showing up immediately.  After the gmond node finally does send its 
metadata, rebroadcasting the metadata at any interval is just consuming 
unnecessary bandwidth on the network.  Especially in a multicast environment 
where it isn't needed at all.  Also consider that the more gmond nodes you have 
the more traffic you are going to but on the network where 99% of the time the 
extra traffic is totally unnecessary.

300 or 600 seconds is probably good enough for a default.  But no matter what 
the default is, users still have to understand what that directive is for and 
how to optimize it.  The value of send_metadata_interval will probably be 
different for every installation when you take into consideration the number of 
nodes, the number of metrics and any other network related variables.

Brad


--
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to 
best implement a security strategy that keeps consumers' information secure 
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl 
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] send_metadata_interval

2011-01-10 Thread Bernard Li
Hi Brad:

Thanks for your reply.

On Mon, Jan 10, 2011 at 8:06 AM, Brad Nicholes bnicho...@novell.com wrote:

 The purpose of setting the send_metadata_interval to 0 by default was to 
 avoid unnecessary traffic for our default configuration of multicast.  
 Setting the directive to anything other than 0 will cause each gmond to start 
 sending all of its metric metadata on that interval.  If you are going to set 
 it by default, IMO 30 seconds is too low.  The problem is that people only 
 notice this in the first few minutes after restarting a gmond.  They expect 
 metrics to start showing up immediately.  After the gmond node finally does 
 send its metadata, rebroadcasting the metadata at any interval is just 
 consuming unnecessary bandwidth on the network.  Especially in a multicast 
 environment where it isn't needed at all.  Also consider that the more gmond 
 nodes you have the more traffic you are going to but on the network where 99% 
 of the time the extra traffic is totally unnecessary.

I have a perhaps naive question.  It sounds like
send_metadata_interval is only relevant to unicast configuration, so
why is multicast affected as well?  How difficult of a code change
would it be if we make the send_metadata_interval directive to only
affect unicast?

Also multicast is the default configuration due to historic reasons
but not because it is more common.  It is however easier to set up if
your environment supports it.  Is it time for us to evaluate whether
we should switch to unicast as the default?  And if so how?  What is
the actual spread between unicast and multicast users?  If it turns
out that the majority of our (new) users are using unicast, should we
spend more time/effort making it easier for them to use Ganglia?

 300 or 600 seconds is probably good enough for a default.  But no matter what 
 the default is, users still have to understand what that directive is for and 
 how to optimize it.  The value of send_metadata_interval will probably be 
 different for every installation when you take into consideration the number 
 of nodes, the number of metrics and any other network related variables.

A couple more ideas came out of a brief brainstorming session on IRC
between Vladimir, Jesse and myself:

1) Collector gmond should request metadata from all gmonds when it has
been freshly (re)started
2) Add a configuration check for gmond so upon starting, if
configuration is unicast-based, and send_metadata_interval is 0, warn
the user to set it to a sane number
3) Find a middle ground of default send_metadata_interval which does
not hurt new users in HPC space wanting to use unicast

2) and 3) are workarounds which could be implemented relatively
quickly, 1) maybe not so much.

Thanks,

Bernard

--
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to 
best implement a security strategy that keeps consumers' information secure 
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl 
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] send_metadata_interval

2011-01-08 Thread Vladimir Vuksan

On Fri, 7 Jan 2011 23:10:06 -0500, Jesse Becker haw...@gmail.com wrote:
 
 I think that it's fine to set this to a non-zero value, but I wonder
 if 30 seconds is too high.  I did a quick set of checking on the
 actual packets that are sent--and specifically the metadata packets.
 I haven't been able to really delve into the code to figure exactly
 what's going on (this part of the code is't terribly transparent to
 me), but I *think* that they are really large--on the order of several
 KB when fully assembled, as compared to less than 100-120 bytes for a
 typical metric packet .  I think that size will increase with the
 number of metrics stored, since each one must be described in full XML
 each time.


I think sending couple kBytes every 30 seconds is not that bad. Even if
you have a 1000 hosts and a 5 kB payload we are talking only about 10
Mbytes every minute. With speeds of networks today I'd consider that to be
noise.

 On a large cluster, with lots of metrics per host, I can see problems
 if the metadata packets are sent too frequently.  I have hosts that
 send well over 300 metrics (lots of CPU cores makes for lots of
 metrics...).  Each of these need to be described in the metadata
 packets.
 
 So I think that setting a non-zero default is fine.  But think that
 something like 300 or 600 seconds would be preferable.

I think we should shoot for a default that works best for most people. 300
or 600 seconds is too long since during those 300-600 seconds I'm flying
blind. This may not matter as much in HPC settings but it matters a lot to
web startups. Secondly most networks are not very big so the overhead will
be minimal.

In closing I'd say let's go with 30 seconds. We can add a comment above
the value that says something like
  - If you are in a large network you may consider making the value higher
as every hosts sends metadata payload of few kilobytes every interval. 


Vladimir

--
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to 
best implement a security strategy that keeps consumers' information secure 
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl 
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] send_metadata_interval

2011-01-07 Thread Jesse Becker
On Fri, Jan 7, 2011 at 15:25, Bernard Li bern...@vanhpc.org wrote:
 Hi all:

 Since the release of Ganglia 3.1, we have introduced the new
 configuration option send_metadata_interval in gmond.conf.  This is
 set to 0 by default and the user must set this to a sane number if
 using unicast otherwise if gmonds are restarted, hosts may appear to
 be offline (this is documented in the release notes).  A bug has
 already been filed:

 http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=242

 We recently have a lot of users having this issue and Vladimir
 recommend that we just set a sane number as the default and be done
 with it, since we end up spending a lot of time on IRC/mailing-list to
 solve the same problem over and over again.

 Since there have been some commits to the 3.1 branch since tagging
 3.1.7, I propose we just copy 3.1.7 tag, update the send_meta_data
 interval in the configuration file and release that as 3.1.8.

 This is not the normal procedure for making a release, so I'd like to
 get some feedback from other developers.

 BTW I am thinking of setting send_metadata_interval to 30 seconds.
 Also, does anybody know if this setting affects multicast setups in
 any way?

I think that it's fine to set this to a non-zero value, but I wonder
if 30 seconds is too high.  I did a quick set of checking on the
actual packets that are sent--and specifically the metadata packets.
I haven't been able to really delve into the code to figure exactly
what's going on (this part of the code is't terribly transparent to
me), but I *think* that they are really large--on the order of several
KB when fully assembled, as compared to less than 100-120 bytes for a
typical metric packet .  I think that size will increase with the
number of metrics stored, since each one must be described in full XML
each time.

The reason for the large size is that an entire XML description of the
metrics appears to be sent each time.  Metadata packets also appear to
go over TCP, not UDP.

My testing was pretty simple:
1) setup a gmond (from SVN, well after 3.1 came out) in unicast mode.
2) set 'send_metadata_interfaval' to 1
3) disable all modules, except for 'mod_core'
4) remove all collection groups.
5) start gmond, and run tcpdump.

On a large cluster, with lots of metrics per host, I can see problems
if the metadata packets are sent too frequently.  I have hosts that
send well over 300 metrics (lots of CPU cores makes for lots of
metrics...).  Each of these need to be described in the metadata
packets.

So I think that setting a non-zero default is fine.  But think that
something like 300 or 600 seconds would be preferable.


-- 
Jesse Becker

--
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to 
best implement a security strategy that keeps consumers' information secure 
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl 
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers