Re: [Ganglia-developers] send_metadata_interval
On 1/10/2011 at 4:52 PM, in message aanlktinbmlmnbcti3q-sjuocmp=+igaggo0trj3gf...@mail.gmail.com, Bernard Li bern...@vanhpc.org wrote: Hi Brad: Thanks for your reply. On Mon, Jan 10, 2011 at 8:06 AM, Brad Nicholes bnicho...@novell.com wrote: The purpose of setting the send_metadata_interval to 0 by default was to avoid unnecessary traffic for our default configuration of multicast. Setting the directive to anything other than 0 will cause each gmond to start sending all of its metric metadata on that interval. If you are going to set it by default, IMO 30 seconds is too low. The problem is that people only notice this in the first few minutes after restarting a gmond. They expect metrics to start showing up immediately. After the gmond node finally does send its metadata, rebroadcasting the metadata at any interval is just consuming unnecessary bandwidth on the network. Especially in a multicast environment where it isn't needed at all. Also consider that the more gmond nodes you have the more traffic you are going to but on the network where 99% of the time the extra traffic is totally unnecessary. I have a perhaps naive question. It sounds like send_metadata_interval is only relevant to unicast configuration, so why is multicast affected as well? How difficult of a code change would it be if we make the send_metadata_interval directive to only affect unicast? We could add code to gmond to always disable resending metadata based on an interval. But then that is what the default value of 0 was doing. Also multicast is the default configuration due to historic reasons but not because it is more common. It is however easier to set up if your environment supports it. Is it time for us to evaluate whether we should switch to unicast as the default? And if so how? What is the actual spread between unicast and multicast users? If it turns out that the majority of our (new) users are using unicast, should we spend more time/effort making it easier for them to use Ganglia? Actually I think this is a good idea. In my experience, unicast seems to be more the norm rather than the exception now. If we were to make unicast the default, then that would make the suggestion above more relevant. We would probably want to put something in the code to automatically disable the send metadata for multicast. 300 or 600 seconds is probably good enough for a default. But no matter what the default is, users still have to understand what that directive is for and how to optimize it. The value of send_metadata_interval will probably be different for every installation when you take into consideration the number of nodes, the number of metrics and any other network related variables. A couple more ideas came out of a brief brainstorming session on IRC between Vladimir, Jesse and myself: 1) Collector gmond should request metadata from all gmonds when it has been freshly (re)started This already happens in multicast mode. Whenever a gmond node receives a metric packet for which it has no metadata, it automatically sends out a request on the channel for metadata. The end result is that all gmond nodes are constantly resyncing themselves until all nodes in a cluster have a complete metadata picture. However, the same can not be done for unicast because, by definition, there is no two-way communication. In order to make the same functionality work for unicast, we would have to introduce a new listen port on every gmond that would accept commands and respond to whatever they are. Doing that opens up a security risk that would have to be dealt with correctly. 2) Add a configuration check for gmond so upon starting, if configuration is unicast-based, and send_metadata_interval is 0, warn the user to set it to a sane number This would be a good idea no matter what else we do. 3) Find a middle ground of default send_metadata_interval which does not hurt new users in HPC space wanting to use unicast 2) and 3) are workarounds which could be implemented relatively quickly, 1) maybe not so much. agreed Brad -- Gaining the trust of online customers is vital for the success of any company that requires sensitive data to be transmitted over the Web. Learn how to best implement a security strategy that keeps consumers' information secure and instills the confidence they need to proceed with transactions. http://p.sf.net/sfu/oracle-sfdevnl ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] send_metadata_interval
On Mon, 10 Jan 2011 15:52:50 -0800, Bernard Li bern...@vanhpc.org wrote: I have a perhaps naive question. It sounds like send_metadata_interval is only relevant to unicast configuration, so why is multicast affected as well? How difficult of a code change would it be if we make the send_metadata_interval directive to only affect unicast? Also multicast is the default configuration due to historic reasons but not because it is more common. It is however easier to set up if your environment supports it. Is it time for us to evaluate whether we should switch to unicast as the default? And if so how? What is the actual spread between unicast and multicast users? If it turns out that the majority of our (new) users are using unicast, should we spend more time/effort making it easier for them to use Ganglia? 300 or 600 seconds is probably good enough for a default. But no matter what the default is, users still have to understand what that directive is for and how to optimize it. The value of send_metadata_interval will probably be different for every installation when you take into consideration the number of nodes, the number of metrics and any other network related variables. A couple more ideas came out of a brief brainstorming session on IRC between Vladimir, Jesse and myself: 1) Collector gmond should request metadata from all gmonds when it has been freshly (re)started 2) Add a configuration check for gmond so upon starting, if configuration is unicast-based, and send_metadata_interval is 0, warn the user to set it to a sane number 3) Find a middle ground of default send_metadata_interval which does not hurt new users in HPC space wanting to use unicast 2) and 3) are workarounds which could be implemented relatively quickly, 1) maybe not so much. I think send_metadata_interval would also be a problem if you set all your agents to be deaf except the collector node(s). I have done just that for security reasons. Vladimir -- Protect Your Site and Customers from Malware Attacks Learn about various malware tactics and how to avoid them. Understand malware threats, the impact they can have on your business, and how you can protect your company and customers by using code signing. http://p.sf.net/sfu/oracle-sfdevnl ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] send_metadata_interval
Hi Brad: On Tue, Jan 11, 2011 at 7:33 AM, Brad Nicholes bnicho...@novell.com wrote: Actually I think this is a good idea. In my experience, unicast seems to be more the norm rather than the exception now. If we were to make unicast the default, then that would make the suggestion above more relevant. We would probably want to put something in the code to automatically disable the send metadata for multicast. I'd like to clarify a few points. Right now if we are using the default multicast setting, if send_metadata_interval directive is omitted, is it set to 0 and thus metadata re-sending by interval is suppressed? If so, I would suggest the following: 1) Do NOT set send_metadata_interval in gmond.conf (we could add a comment if so desired) 2) Add check in libconfuse parsing of gmond.conf -- if host and port are specified (meaning unicast), send_metadata_interval must be = 0, if not a warning message is displayed and gmond is not started 3) Perhaps move the send_metadata_interval directive from the global section to each udp_send_channel section? My $0.02. Thanks, Bernard -- Protect Your Site and Customers from Malware Attacks Learn about various malware tactics and how to avoid them. Understand malware threats, the impact they can have on your business, and how you can protect your company and customers by using code signing. http://p.sf.net/sfu/oracle-sfdevnl ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] send_metadata_interval
On 1/7/2011 at 9:10 PM, in message aanlktikfk_hy2v_zvkb_pra6vxmeqnv3nw3iokhxx...@mail.gmail.com, Jesse Becker haw...@gmail.com wrote: On Fri, Jan 7, 2011 at 15:25, Bernard Li bern...@vanhpc.org wrote: Hi all: Since the release of Ganglia 3.1, we have introduced the new configuration option send_metadata_interval in gmond.conf. This is set to 0 by default and the user must set this to a sane number if using unicast otherwise if gmonds are restarted, hosts may appear to be offline (this is documented in the release notes). A bug has already been filed: http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=242 We recently have a lot of users having this issue and Vladimir recommend that we just set a sane number as the default and be done with it, since we end up spending a lot of time on IRC/mailing-list to solve the same problem over and over again. Since there have been some commits to the 3.1 branch since tagging 3.1.7, I propose we just copy 3.1.7 tag, update the send_meta_data interval in the configuration file and release that as 3.1.8. This is not the normal procedure for making a release, so I'd like to get some feedback from other developers. BTW I am thinking of setting send_metadata_interval to 30 seconds. Also, does anybody know if this setting affects multicast setups in any way? I think that it's fine to set this to a non-zero value, but I wonder if 30 seconds is too high. I did a quick set of checking on the actual packets that are sent--and specifically the metadata packets. I haven't been able to really delve into the code to figure exactly what's going on (this part of the code is't terribly transparent to me), but I *think* that they are really large--on the order of several KB when fully assembled, as compared to less than 100-120 bytes for a typical metric packet . I think that size will increase with the number of metrics stored, since each one must be described in full XML each time. The reason for the large size is that an entire XML description of the metrics appears to be sent each time. Metadata packets also appear to go over TCP, not UDP. My testing was pretty simple: 1) setup a gmond (from SVN, well after 3.1 came out) in unicast mode. 2) set 'send_metadata_interfaval' to 1 3) disable all modules, except for 'mod_core' 4) remove all collection groups. 5) start gmond, and run tcpdump. On a large cluster, with lots of metrics per host, I can see problems if the metadata packets are sent too frequently. I have hosts that send well over 300 metrics (lots of CPU cores makes for lots of metrics...). Each of these need to be described in the metadata packets. So I think that setting a non-zero default is fine. But think that something like 300 or 600 seconds would be preferable. The purpose of setting the send_metadata_interval to 0 by default was to avoid unnecessary traffic for our default configuration of multicast. Setting the directive to anything other than 0 will cause each gmond to start sending all of its metric metadata on that interval. If you are going to set it by default, IMO 30 seconds is too low. The problem is that people only notice this in the first few minutes after restarting a gmond. They expect metrics to start showing up immediately. After the gmond node finally does send its metadata, rebroadcasting the metadata at any interval is just consuming unnecessary bandwidth on the network. Especially in a multicast environment where it isn't needed at all. Also consider that the more gmond nodes you have the more traffic you are going to but on the network where 99% of the time the extra traffic is totally unnecessary. 300 or 600 seconds is probably good enough for a default. But no matter what the default is, users still have to understand what that directive is for and how to optimize it. The value of send_metadata_interval will probably be different for every installation when you take into consideration the number of nodes, the number of metrics and any other network related variables. Brad -- Gaining the trust of online customers is vital for the success of any company that requires sensitive data to be transmitted over the Web. Learn how to best implement a security strategy that keeps consumers' information secure and instills the confidence they need to proceed with transactions. http://p.sf.net/sfu/oracle-sfdevnl ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] send_metadata_interval
Hi Brad: Thanks for your reply. On Mon, Jan 10, 2011 at 8:06 AM, Brad Nicholes bnicho...@novell.com wrote: The purpose of setting the send_metadata_interval to 0 by default was to avoid unnecessary traffic for our default configuration of multicast. Setting the directive to anything other than 0 will cause each gmond to start sending all of its metric metadata on that interval. If you are going to set it by default, IMO 30 seconds is too low. The problem is that people only notice this in the first few minutes after restarting a gmond. They expect metrics to start showing up immediately. After the gmond node finally does send its metadata, rebroadcasting the metadata at any interval is just consuming unnecessary bandwidth on the network. Especially in a multicast environment where it isn't needed at all. Also consider that the more gmond nodes you have the more traffic you are going to but on the network where 99% of the time the extra traffic is totally unnecessary. I have a perhaps naive question. It sounds like send_metadata_interval is only relevant to unicast configuration, so why is multicast affected as well? How difficult of a code change would it be if we make the send_metadata_interval directive to only affect unicast? Also multicast is the default configuration due to historic reasons but not because it is more common. It is however easier to set up if your environment supports it. Is it time for us to evaluate whether we should switch to unicast as the default? And if so how? What is the actual spread between unicast and multicast users? If it turns out that the majority of our (new) users are using unicast, should we spend more time/effort making it easier for them to use Ganglia? 300 or 600 seconds is probably good enough for a default. But no matter what the default is, users still have to understand what that directive is for and how to optimize it. The value of send_metadata_interval will probably be different for every installation when you take into consideration the number of nodes, the number of metrics and any other network related variables. A couple more ideas came out of a brief brainstorming session on IRC between Vladimir, Jesse and myself: 1) Collector gmond should request metadata from all gmonds when it has been freshly (re)started 2) Add a configuration check for gmond so upon starting, if configuration is unicast-based, and send_metadata_interval is 0, warn the user to set it to a sane number 3) Find a middle ground of default send_metadata_interval which does not hurt new users in HPC space wanting to use unicast 2) and 3) are workarounds which could be implemented relatively quickly, 1) maybe not so much. Thanks, Bernard -- Gaining the trust of online customers is vital for the success of any company that requires sensitive data to be transmitted over the Web. Learn how to best implement a security strategy that keeps consumers' information secure and instills the confidence they need to proceed with transactions. http://p.sf.net/sfu/oracle-sfdevnl ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] send_metadata_interval
On Fri, 7 Jan 2011 23:10:06 -0500, Jesse Becker haw...@gmail.com wrote: I think that it's fine to set this to a non-zero value, but I wonder if 30 seconds is too high. I did a quick set of checking on the actual packets that are sent--and specifically the metadata packets. I haven't been able to really delve into the code to figure exactly what's going on (this part of the code is't terribly transparent to me), but I *think* that they are really large--on the order of several KB when fully assembled, as compared to less than 100-120 bytes for a typical metric packet . I think that size will increase with the number of metrics stored, since each one must be described in full XML each time. I think sending couple kBytes every 30 seconds is not that bad. Even if you have a 1000 hosts and a 5 kB payload we are talking only about 10 Mbytes every minute. With speeds of networks today I'd consider that to be noise. On a large cluster, with lots of metrics per host, I can see problems if the metadata packets are sent too frequently. I have hosts that send well over 300 metrics (lots of CPU cores makes for lots of metrics...). Each of these need to be described in the metadata packets. So I think that setting a non-zero default is fine. But think that something like 300 or 600 seconds would be preferable. I think we should shoot for a default that works best for most people. 300 or 600 seconds is too long since during those 300-600 seconds I'm flying blind. This may not matter as much in HPC settings but it matters a lot to web startups. Secondly most networks are not very big so the overhead will be minimal. In closing I'd say let's go with 30 seconds. We can add a comment above the value that says something like - If you are in a large network you may consider making the value higher as every hosts sends metadata payload of few kilobytes every interval. Vladimir -- Gaining the trust of online customers is vital for the success of any company that requires sensitive data to be transmitted over the Web. Learn how to best implement a security strategy that keeps consumers' information secure and instills the confidence they need to proceed with transactions. http://p.sf.net/sfu/oracle-sfdevnl ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] send_metadata_interval
On Fri, Jan 7, 2011 at 15:25, Bernard Li bern...@vanhpc.org wrote: Hi all: Since the release of Ganglia 3.1, we have introduced the new configuration option send_metadata_interval in gmond.conf. This is set to 0 by default and the user must set this to a sane number if using unicast otherwise if gmonds are restarted, hosts may appear to be offline (this is documented in the release notes). A bug has already been filed: http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=242 We recently have a lot of users having this issue and Vladimir recommend that we just set a sane number as the default and be done with it, since we end up spending a lot of time on IRC/mailing-list to solve the same problem over and over again. Since there have been some commits to the 3.1 branch since tagging 3.1.7, I propose we just copy 3.1.7 tag, update the send_meta_data interval in the configuration file and release that as 3.1.8. This is not the normal procedure for making a release, so I'd like to get some feedback from other developers. BTW I am thinking of setting send_metadata_interval to 30 seconds. Also, does anybody know if this setting affects multicast setups in any way? I think that it's fine to set this to a non-zero value, but I wonder if 30 seconds is too high. I did a quick set of checking on the actual packets that are sent--and specifically the metadata packets. I haven't been able to really delve into the code to figure exactly what's going on (this part of the code is't terribly transparent to me), but I *think* that they are really large--on the order of several KB when fully assembled, as compared to less than 100-120 bytes for a typical metric packet . I think that size will increase with the number of metrics stored, since each one must be described in full XML each time. The reason for the large size is that an entire XML description of the metrics appears to be sent each time. Metadata packets also appear to go over TCP, not UDP. My testing was pretty simple: 1) setup a gmond (from SVN, well after 3.1 came out) in unicast mode. 2) set 'send_metadata_interfaval' to 1 3) disable all modules, except for 'mod_core' 4) remove all collection groups. 5) start gmond, and run tcpdump. On a large cluster, with lots of metrics per host, I can see problems if the metadata packets are sent too frequently. I have hosts that send well over 300 metrics (lots of CPU cores makes for lots of metrics...). Each of these need to be described in the metadata packets. So I think that setting a non-zero default is fine. But think that something like 300 or 600 seconds would be preferable. -- Jesse Becker -- Gaining the trust of online customers is vital for the success of any company that requires sensitive data to be transmitted over the Web. Learn how to best implement a security strategy that keeps consumers' information secure and instills the confidence they need to proceed with transactions. http://p.sf.net/sfu/oracle-sfdevnl ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers