[Ganglia-general] Application Monitoring with Ganglia
Hi, I am afraid I know the answer, but just to be sure... I am monitoring a bunch of Linux servers running one or more JVMs with Ganglia. In order to get some get some insight in the resource usage of the JVMs we use "jmxtrans" to retrieve the metrics and spoof them to an Ganglia aggregator. Works fine with one JVM, but gives trouble with two and more. Reason the metrics are called the same. So I have the idea to group the metrics of each JVM into separate metrics groups JVM1, JVM2, JVM3 ... The problem is that this still does not seem to work. What I want is HostX -JVM1 --Metric1 --Metric2 --Metric3 -JVM2 --Metric1 --Metric2 --Metric3 -JVM3 --Metric1 --Metric2 --Metric3 That is nine metrics in three groups. But I only see three metrics and they "jump" from group to group. Works fine if I make the Metrics names unique.So it seems there is a uniqueness requirement on the metrics level. It would be really nice, if that requirement would could restricted to the group level. Any chance? Thanks Martin -- ---------- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] High SystemCPU usage, low UserCPU usage
Dear Khan, as Vladimir said, "System CPU" is spent in the kernel on I/O, Interrupts, memory management. Just out of curiosity: what Linux are you (is your customer) running, which kernel version and what is the uptime? I ask, because I recently was facing a similar issue on Servers running SLES11/SP2 (kernel 3.0.58-0.6.2-default). Those were used for Tomcat (Java) processes, not HPC. They started to really max out all CPUs 100% with 75% solid "red". But that happened only after some days of uptime It turned out that in our situation turning of the half-baked (at least in that kernel) "Transparent Huge Pages" feature off (or to voluntary mode) solved the problem: # echo madvise > /sys/kernel/mm/transparent_hugepage/enabled # echo madvise > /sys/kernel/mm/transparent_hugepage/defrag # cat /sys/kernel/mm/transparent_hugepage/{enabled,defrag} always [madvise] never always [madvise] never Doing that is pretty much without risk and can be done/reverted at any time. It may cost a bit of performance in systems with lots of memory, but I personally think it is overrated for general usage. As I said, not sure it applies to your situation, but comes from a real world high throughput environment. Cheers Martin On Tue, Oct 13, 2015 at 7:49 PM, Kamran Khan <kam...@pssclabs.com> wrote: > Hi All, > > > This isn't a problem with Ganglia, but I was hoping I might get a little > advice on what I am seeing. I have a customer who is running ls-dyna > applications, and he is noticing something odd. He is noticing his jobs > being bogged down and not running at their full capacity. He looked at the > Ganglia web interface and saw that "System CPU" was at 100%, while "User > CPU" was at like 20%. What processes does the "System CPU" refer to? What > tools can I use to track what might be pushing the "System CPU" to 100%? > There are times when the "User CPU" goes up to 100%, which is what he > wants, but then at times it spikes down to 20% ish and the "System CPU" > stays up around 100%. > > > Any advice is greatly appreciated. If you need me to send output, I > certainly can. Just let me know what to run. > > > Please let me know. > > > Thanks. > -- > Kamran Khan > PSSC Labs > HPC Software / Technical Engineer > > > -- > > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > -- -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Monitoring CTX switches and memory fragmentation
Hi Vladimir, is the CTX stuff already in a released version? I may need to tell the end customer to upgrade. Cheers Martin On Tue, May 5, 2015 at 4:12 PM, Vladimir Vuksan vli...@veus.hr wrote: I have wrote one for memory fragmentation. You can find it here https://github.com/ganglia/gmond_python_modules/tree/master/system/mem_fragmentation Context stuff is now in the monitor-core master https://github.com/ganglia/monitor-core/blob/master/gmond/python_modules/cpu/cpu_stats.py Vladimir On 05/05/2015 02:49 AM, Martin Knoblauch wrote: Hi friends, short question: does Ganglia provide monitor agents for context switches and memory fragmentation (e.g. listing contents of /proc/buddyinfo)? I want to avoid double work, should they exist officially? Cheers Martin -- -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Ganglia-general mailing listGanglia-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/ganglia-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Monitoring CTX switches and memory fragmentation
Hi friends, short question: does Ganglia provide monitor agents for context switches and memory fragmentation (e.g. listing contents of /proc/buddyinfo)? I want to avoid double work, should they exist officially? Cheers Martin -- -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Combining metrics from several RRD files
Hi friends, hope somebody already had this problem and solved it. So I have a cluster were we monitor the status (size, used, free) for several filesystems using Ganglia. Looks all great in the browser, but now the customer wants to have those data sets combined into one. In order to not loose the data we have, I want to combine those into one RRD. All the source RRDs have identical structure (RRAs) and timestamps. Any solution? Ideas? Cheers Martin -- -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Combining metrics from several RRD files
Hi Arnau, not completely :-) I actually want to extract the data from the RRD files and combine them into one, adding up the vaules. Good thing, I found out about rrdtool xport. I does what I want on the extracting. Now I just need to do the summing up. Cheers Martin On Fri, Jan 31, 2014 at 11:21 AM, Arnau Bria listsar...@gmail.com wrote: On Fri, 31 Jan 2014 10:37:19 +0100 Martin Knoblauch wrote: Hi friends, Hi, hope somebody already had this problem and solved it. So I have a cluster were we monitor the status (size, used, free) for several filesystems using Ganglia. Looks all great in the browser, but now the customer wants to have those data sets combined into one. In order to not loose the data we have, I want to combine those into one RRD. All the source RRDs have identical structure (RRAs) and timestamps. Any solution? Ideas? If I've understood you property: 1.-) use the Aggregate Graphs from ganglia's web. 2.-) create a custom grpah and add it to one host : quick google search: http://sourceforge.net/mailarchive/forum.php?thread_name=503E2A47.6020705%40gmail.comforum_name=ganglia-general 3.-) as they are RRDs you can mix them using your own script (bash, perl, python) HTH, Cheers Martin Arnau -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Combining metrics from several RRD files
Hi Vladimir, thanks. That is also an option. What I came up with is the following RRD magic. Combines 3 metrics metrics from 4 filesystems, removes the NaNs and computes the percentage used accurately: rrdtool xport --start now-366d --end now-1d \ DEF:t000=vault_000_total.rrd:sum:AVERAGE \ DEF:t001=vault_001_total.rrd:sum:AVERAGE \ DEF:t002=vault_002_total.rrd:sum:AVERAGE \ DEF:t003=vault_003_total.rrd:sum:AVERAGE \ CDEF:total=t000,t001,ADDNAN,t002,ADDNAN,t003,ADDNAN,1.09951E+12,/ \ DEF:u000=vault_000_used.rrd:sum:AVERAGE \ DEF:u001=vault_001_used.rrd:sum:AVERAGE \ DEF:u002=vault_002_used.rrd:sum:AVERAGE \ DEF:u003=vault_003_used.rrd:sum:AVERAGE \ CDEF:used=u000,u001,ADDNAN,u002,ADDNAN,u003,ADDNAN,1.09951E+12,/ \ DEF:a000=vault_000_avail.rrd:sum:AVERAGE \ DEF:a001=vault_001_avail.rrd:sum:AVERAGE \ DEF:a002=vault_002_avail.rrd:sum:AVERAGE \ DEF:a003=vault_003_avail.rrd:sum:AVERAGE \ CDEF:avail=a000,a001,ADDNAN,a002,ADDNAN,a003,ADDNAN,1.09951E+12,/ \ CDEF:pctc=total,avail,-,total,/ \ XPORT:total:Total (TB) XPORT:used:Used (TB) XPORT:avail:Avail (TB) XPORT:pctc:PCT used (%) RRDTOOL is cool :-) Cheers Martin On Fri, Jan 31, 2014 at 3:26 PM, Vladimir Vuksan vli...@veus.hr wrote: Another alternative is to use CSV or JSON export from the Web Ui eg http://blog.vuksan.com/2012/04/06/ It will eg export all values from aggregate graphs as well so you can do the summing On 31. siječnja 2014. 09:19:30 EST, Martin Knoblauch kn...@knobisoft.de wrote: Hi Arnau, not completely :-) I actually want to extract the data from the RRD files and combine them into one, adding up the vaules. Good thing, I found out about rrdtool xport. I does what I want on the extracting. Now I just need to do the summing up. Cheers Martin On Fri, Jan 31, 2014 at 11:21 AM, Arnau Bria listsar...@gmail.comwrote: On Fri, 31 Jan 2014 10:37:19 +0100 Martin Knoblauch wrote: Hi friends, Hi, hope somebody already had this problem and solved it. So I have a cluster were we monitor the status (size, used, free) for several filesystems using Ganglia. Looks all great in the browser, but now the customer wants to have those data sets combined into one. In order to not loose the data we have, I want to combine those into one RRD. All the source RRDs have identical structure (RRAs) and timestamps. Any solution? Ideas? If I've understood you property: 1.-) use the Aggregate Graphs from ganglia's web. 2.-) create a custom grpah and add it to one host : quick google search: http://sourceforge.net/mailarchive/forum.php?thread_name=503E2A47.6020705%40gmail.comforum_name=ganglia-general 3.-) as they are RRDs you can mix them using your own script (bash, perl, python) HTH, Cheers Martin Arnau -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general Vladimir -- -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x
Hi Daniel, JMXetric is one of the options I am considering. The other is JMXtrans. Both are now using gmetric4j. - JMXetric has the advantage that I can instrument the tomcat directly and send to the local gmond, without any spoofing. The disadvantage is that it changes the application and needs a lot testing for productive use - JMXtrans has the advantage that it is external to the application. The beauty is that one *could* have a central JMX aggregator which would spoof the data to the aggregating gmonds. Unfortunatelly there seems to be a prblem with spoofing, gmetric4j and the 3.1 wireformat. Seems this is just not supported. Alternatively one could of course run local JMXtrans instances on evers tomcat host. Not that nice ... Brings me back to my question at the developers list. What is the story of gmetric4j vs. spoofing. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Daniel Pocock dan...@pocock.com.au To: ganglia-general@lists.sourceforge.net Sent: Sunday, September 16, 2012 8:51 PM Subject: Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x Have you looked at JMXetric? The latest code is in the main community github now https://github.com/ganglia/jmxetric It originated here: http://code.google.com/p/jmxetric/ but I have recently split the JMX stuff, so that non-JMX users can just use it as gmetric4j. So for JMX, you use gmetric4j + jmxetric together. On 16/09/12 15:02, Martin Knoblauch wrote: Hi Peter, thanks. Unfortunatelly due to the situation at the customer ite I am bound to 3.1.x. But I will remember this. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Peter Phaal peter.ph...@gmail.com To: Martin Knoblauch kn...@knobisoft.de Cc: ganglia general ganglia-general@lists.sourceforge.net Sent: Saturday, September 15, 2012 12:57 AM Subject: Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x Martin, If you can upgrade to the latest Ganglia release you could use sFlow to monitor your Tomcat servers, the jxm-sflow-agent exports standard JVM metrics, or the tomcat-sflow-valve can export the JVM metrics as well as HTTP counters and transactions. http://host-sflow.sourceforge.net/relatedlinks.php Cheers, Peter On Thu, Sep 13, 2012 at 5:43 AM, Martin Knoblauch kn...@knobisoft.de wrote: Hi, as part of a larger tomcat deployment I need to monitor several tomcat instances and want to add the measured data to a Ganglia setup. I already found JMXtrans which seems a cool solution, but it uses host spoofing and I am not sure it is what I really want. Needs some real investigating. What I would love would to have would be a Gmond plugin that just can add the measured metric to the system metrics. Has anybody already done such a plugin or is working on it? I could provide testing, feedback and maybe help. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Live Security Virtual
Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x
Hi Peter, thanks. Unfortunatelly due to the situation at the customer ite I am bound to 3.1.x. But I will remember this. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Peter Phaal peter.ph...@gmail.com To: Martin Knoblauch kn...@knobisoft.de Cc: ganglia general ganglia-general@lists.sourceforge.net Sent: Saturday, September 15, 2012 12:57 AM Subject: Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x Martin, If you can upgrade to the latest Ganglia release you could use sFlow to monitor your Tomcat servers, the jxm-sflow-agent exports standard JVM metrics, or the tomcat-sflow-valve can export the JVM metrics as well as HTTP counters and transactions. http://host-sflow.sourceforge.net/relatedlinks.php Cheers, Peter On Thu, Sep 13, 2012 at 5:43 AM, Martin Knoblauch kn...@knobisoft.de wrote: Hi, as part of a larger tomcat deployment I need to monitor several tomcat instances and want to add the measured data to a Ganglia setup. I already found JMXtrans which seems a cool solution, but it uses host spoofing and I am not sure it is what I really want. Needs some real investigating. What I would love would to have would be a Gmond plugin that just can add the measured metric to the system metrics. Has anybody already done such a plugin or is working on it? I could provide testing, feedback and maybe help. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Java/JMX plugin for Ganglia 3.1.x
Hi, as part of a larger tomcat deployment I need to monitor several tomcat instances and want to add the measured data to a Ganglia setup. I already found JMXtrans which seems a cool solution, but it uses host spoofing and I am not sure it is what I really want. Needs some real investigating. What I would love would to have would be a Gmond plugin that just can add the measured metric to the system metrics. Has anybody already done such a plugin or is working on it? I could provide testing, feedback and maybe help. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de-- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia gmond memory leak?
Hi Aidan, for what it is worth, I cannot reproduce the growing memory consumption on a small 3.2.0 grid using only standard metrics in unicast mode. Running now for a few hours. Will check again tomorrow. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Aidan Wong aidanw...@attinteractive.com To: Ave-Lallemant, Nathan P nathan.p.ave-lallem...@efleets.com; ganglia-general ganglia-general@lists.sourceforge.net Sent: Thursday, February 23, 2012 8:34 AM Subject: Re: [Ganglia-general] Ganglia gmond memory leak? I've restarted the gmond process and memory usage drops until gmond hogs memory over time. Any Ganglia contributors who may want to chime in on this memory leak issue? I'm on Ganglia 3.2.0. Are there any improvements on version 3.3.1 addressing this issue? Thanks From: Ave-Lallemant, Nathan P nathan.p.ave-lallem...@efleets.com Date: Wed, 22 Feb 2012 16:31:58 -0600 To: Aidan Wong aidanw...@attinteractive.com, ganglia-general ganglia-general@lists.sourceforge.net Subject: RE: Ganglia gmond memory leak? I have seen the same behavior in my environment but do not have a solution. Nathan From:Aidan Wong [mailto:aidanw...@attinteractive.com] Sent: Wednesday, February 22, 2012 4:10 PM To: ganglia-general Subject: [Ganglia-general] Ganglia gmond memory leak? Hi it looks like my install of gmond version 3.2.0 is leaking memory. The amount of resident used memory that the process uses, gets up pretty high and keeps increasing. USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18647 0.0 9.9 2965464 1836268 ? Ss Jan14 11:24 /home/t/hadoop-ganglia-client/sbin/gmond -c /home/t/hadoop-ganglia-client/gmond.conf -p /home/t/hadoop-ganglia-client/logs/gmond.pid Is this a bug? Can anyone suggest a solution? Thank you CONFIDENTIALITY NOTICE: This e-mail and any files transmitted with it are intended solely for the use of the individual or entity to whom they are addressed and may contain confidential and privileged information protected by law. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of the e-mail is strictly prohibited. Please notify the sender immediately by return e-mail and delete all copies from your system. -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia gmond memory leak?
Hi Aidan, if possible for you, I would suggest running the gmond in foreground under the control of valgrind or a similar tool. Send us the report generated by the tool. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Aidan Wong aidanw...@attinteractive.com To: Ave-Lallemant, Nathan P nathan.p.ave-lallem...@efleets.com; ganglia-general ganglia-general@lists.sourceforge.net Sent: Thursday, February 23, 2012 8:34 AM Subject: Re: [Ganglia-general] Ganglia gmond memory leak? I've restarted the gmond process and memory usage drops until gmond hogs memory over time. Any Ganglia contributors who may want to chime in on this memory leak issue? I'm on Ganglia 3.2.0. Are there any improvements on version 3.3.1 addressing this issue? Thanks From: Ave-Lallemant, Nathan P nathan.p.ave-lallem...@efleets.com Date: Wed, 22 Feb 2012 16:31:58 -0600 To: Aidan Wong aidanw...@attinteractive.com, ganglia-general ganglia-general@lists.sourceforge.net Subject: RE: Ganglia gmond memory leak? I have seen the same behavior in my environment but do not have a solution. Nathan From:Aidan Wong [mailto:aidanw...@attinteractive.com] Sent: Wednesday, February 22, 2012 4:10 PM To: ganglia-general Subject: [Ganglia-general] Ganglia gmond memory leak? Hi it looks like my install of gmond version 3.2.0 is leaking memory. The amount of resident used memory that the process uses, gets up pretty high and keeps increasing. USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18647 0.0 9.9 2965464 1836268 ? Ss Jan14 11:24 /home/t/hadoop-ganglia-client/sbin/gmond -c /home/t/hadoop-ganglia-client/gmond.conf -p /home/t/hadoop-ganglia-client/logs/gmond.pid Is this a bug? Can anyone suggest a solution? Thank you CONFIDENTIALITY NOTICE: This e-mail and any files transmitted with it are intended solely for the use of the individual or entity to whom they are addressed and may contain confidential and privileged information protected by law. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of the e-mail is strictly prohibited. Please notify the sender immediately by return e-mail and delete all copies from your system. -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia gmond memory leak?
Hi Jesse, but in that case the memory footprint of gmond would approach a maximum after some time - correct? Aidan did not say whether it grows forever or goes asymptotic. Aidan? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Jesse Becker haw...@gmail.com To: Aidan Wong aidanw...@attinteractive.com Cc: ganglia-general ganglia-general@lists.sourceforge.net Sent: Thursday, February 23, 2012 2:36 PM Subject: Re: [Ganglia-general] Ganglia gmond memory leak? How many metrics are you monitoring? gmond must allocated memory for each metric, from each host. If you are using multicast, each gmond instance will get metrics from all other instances. If you run gmond in isolation--no traffic to/from other gmond instances--does memory usage still go up? On Wed, Feb 22, 2012 at 17:10, Aidan Wong aidanw...@attinteractive.com wrote: Hi it looks like my install of gmond version 3.2.0 is leaking memory. The amount of resident used memory that the process uses, gets up pretty high and keeps increasing. USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18647 0.0 9.9 2965464 1836268 ? Ss Jan14 11:24 /home/t/hadoop-ganglia-client/sbin/gmond -c /home/t/hadoop-ganglia-client/gmond.conf -p /home/t/hadoop-ganglia-client/logs/gmond.pid Is this a bug? Can anyone suggest a solution? Thank you -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Jesse Becker -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Looking for 3.1.7 binaries/rpms for RHEL-5.x on IA64
Hi folks, someone have those available? Species on the extinction list - I know, but a customer has a bunch of those. Thanks in advance Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de-- Write once. Port to many. Get the SDK and tools to simplify cross-platform app development. Create new or port existing apps to sell to consumers worldwide. Explore the Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join http://p.sf.net/sfu/intel-appdev___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Gmond not reporting some metrics (3.1.7 unicast running on RHEL-6.1)
Hi, while setting up a new cluster, I came across the following problem: a) Headnode RHEL-6.1 (x86_64, ESX VM, yum up-to-date) with gmetad/gmond 3.1.7 RPMs from EPEL b) Gmond node RHEL-6.1 (x86_64, real hardware, not up-to-date for customer reason) 3.1.7 RPM from EPEL, different network Unicast setup, with both gmonds reporting to themselves and to each other. Multicast not possible due to Switch/Router refusing to do multicast. The gmond-only node fails to report bytes in, bytes out, load (besides load-1), memory and cpu metrics. Under debug I see that it is monitoring those metrics, but not sending, although there should be changes beyond the thresholds. The node with gmond/gmetad works great. Any ideas? I saw some similar reports with RHEL-5.5, but no conclusinon. If needed, I can produce config files tomorrow. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de-- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] revisiting bogus spikes
Hi David, this is kind of helpful. What seems to happen is that the bytes in counter (rbi) for you network card seems to completety wrap around or is going backwards for about 20-210 MB between two calls to update_ifdata. This would definitely lead to PB spikes. If I recall correctly, this is a bit different from the case that made me write that REMOVE_BOGUS_SPIKES thing. There the bogus numbers were much more erratic. I modelled the thresholds in the #ifdef REMOVE_BOGUS_SPIKES section: if ((l_bin 1.0e13) || (l_bout 1.0e13) || (l_pin 1.0e8) || (l_pout 1.0e8)) { They might not be adequate for your scenario. You may need to add a few more debug statemens to find the right values. Without actually having such a system at hands I cannot do much more. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: David Lee david.yi@gmail.com To: ganglia-general@lists.sourceforge.net Sent: Monday, July 18, 2011 8:41 AM Subject: [Ganglia-general] revisiting bogus spikes I wanted to add to the original thread regarding bogus spikes in network graphs, which were suspected to be caused by broadcom NICs that ship with many of the HP Proliant series servers today. We're running HP BL460G6, with vmware ESXi 4.1u1 hypervisors, and RHEL5.3 x64 guests. Using gmond-3.2 built off of the ganglia-3.2.0 source rpm, we're seeing the network spikes as well (PB range). Running in debug=10, I've found entries like this: update_ifdata(BO) - Overflow in rbi: 910239662712 - 910029125551 ** bytes_out: 234956.359375 metric 'bytes_out' has value_threshold 4096.00 metric 'bytes_in' being collected now ** bytes_in: 461075631262662656.00 metric 'bytes_in' has value_threshold 4096.00 metric 'pkts_in' being collected now ** pkts_in: 251.174362 metric 'pkts_in' has value_threshold 256.00 metric 'pkts_out' being collected now ** pkts_out: 166.366455 metric 'pkts_out' has value_threshold 256.00 update_ifdata(BO) - Overflow in rbi: 916309233232 - 916289211909 ** bytes_out: 375413.312500 metric 'bytes_out' has value_threshold 4096.00 metric 'bytes_in' being collected now ** bytes_in: 461094494759026688.00 metric 'bytes_in' has value_threshold 4096.00 metric 'pkts_in' being collected now ** pkts_in: 498.569885 metric 'pkts_in' has value_threshold 256.00 metric 'pkts_out' being collected now ** pkts_out: 303.376251 metric 'pkts_out' has value_threshold 256.00 Kernel 2.6.18-128.el5 #1 I was not able to find any other obvious error messages related to interface metrics. We are seeing this across all of our Proliant series servers. Thanks DL -- AppSumo Presents a FREE Video for the SourceForge Community by Eric Ries, the creator of the Lean Startup Methodology on Lean Startup Secrets Revealed. This video shows you how to validate your ideas, optimize your ideas and identify your business strategy. http://p.sf.net/sfu/appsumosfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- AppSumo Presents a FREE Video for the SourceForge Community by Eric Ries, the creator of the Lean Startup Methodology on Lean Startup Secrets Revealed. This video shows you how to validate your ideas, optimize your ideas and identify your business strategy. http://p.sf.net/sfu/appsumosfdev2dev___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] revisiting bogus spikes
Hi Patrick, it would be *really* important to see the debug messages that are part of the network metric code on Linux. That way we would see what the counters are when the spikes happen. This could provide more insight. As for making my/the REMOVE_BOGUS_SPIKES default I have my doubts. At least in the current form it is modelled very strict to the failure mode I experienced back in 200x. It also has some smoothing/levelling effect on the data that might not be welcome. Especially with interfaces faster than 1G. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message - From: Patrick Gilbert pdgilb...@gmail.com To: Ganglia-general@lists.sourceforge.net Cc: Sent: Wednesday, June 22, 2011 2:36 AM Subject: Re: [Ganglia-general] revisiting bogus spikes So, to add to some of the data I've read here: I'm also experiencing this issue on VMX3 clusters with para-virtualization enabled. Seems odd that an OS that has no real knowledge of the physical network hardware would also exhibit the spiking issue. Has anyone else experienced this? To be fair, the underlying hardware does contain the Broadcom NICs. Also on this same topic, will the REMOVE_BOGUS_SPIKES flag be a default flag on future releases? Can anyone confirm this works ( so I don't have to recomplie :)? Thanks, Patrick Gilbert -- Simplify data backup and recovery for your virtual environment with vRanger. Installation's a snap, and flexible recovery options mean your data is safe, secure and there when you need it. Data protection magic? Nope - It's vRanger. Get your free trial download today. http://p.sf.net/sfu/quest-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general vg -- Simplify data backup and recovery for your virtual environment with vRanger. Installation's a snap, and flexible recovery options mean your data is safe, secure and there when you need it. Data protection magic? Nope - It's vRanger. Get your free trial download today. http://p.sf.net/sfu/quest-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] revisiting bogus spikes
Hi Cameron, [adding the developers list] OK: 1) we write the unmodified data in line 233 to capture the raw counters. That is what we are using in line 227 for the comparison 2) ns is created and returned by hash_lookup 3) The ULONG_MAX logic in line 231 is there because we need to ensure that the result is always positive. Needed because the variables are unsigned. 4) update_ifdata is called once by metric_init and then every time one of the byte/pkts_in/out collectors fires Now this does not solve your problem ... Question: do you see any of the debug messages that should be created by update_ifdata in case of something unusual? That should help to get an idea on how the interface counters on your machine(s) look like. Lokk in /var/log/messages, or just start gmond noninteractive. Hmm. Another question: do you compile gmond in 64-bit or 32-bit mode? The ULONG_MAX logic may/will fail in 32-bit mode, if the kernel is 64-bit. It could even be that the interface counters on 32-bit kernels are written as 64-bit values. Hope this helps Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Cameron L. Spitzer cspit...@nvidia.com To: ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Sent: Thu, April 28, 2011 3:21:04 AM Subject: [Ganglia-general] revisiting bogus spikes Once again I've been asked to make Ganglia usable on Linux hosts with the Broadcom NIC with the 32-bit byte counters. E.g., HP Proliant 580 G5, a rather popular machine where Ganglia doesn't work out of the box. So I'm trying to understand ganglia-3.1.7/libmetrics/linux/metrics.c again. In update_ifdata(), we parse /proc/net/dev for the current bytes and packets in and out. There's a structure ns (declared where?) of type net_dev_stats, representing the previous sample? I'm not sure exactly what ns represents. There's a sanity check at line 227 if ( rbi = ns-rbi ) for whether the counter went up or down. If it went down, we assume the counter rolled around, and guess the value is negative, and invert it, line 231. l_bytes_in += ULONG_MAX - ns-rbi + rbi; (I don't understand how that is supposed to work.) Then, regardless of whether the sample passed or failed the sanity check, it's saved in the ns structure. Line 233, ns-rpi = rpi; After the parsing is all done, and the crazy value is in ns, an optional reasonableness test (REMOVE_BOGUS_SPIKES) returns early if any of the numbers are extremely large. Otherwise it updates the static running counts and then returns. On our HP 580G5s, defining REMOVE_BOGUS_SPIKES had no effect. The network traffic graphs become useless within a minute of starting gmond. The part I don't understand is when the line 227 check fails, we put the known-bad data in ns anyway. I'd appreciate it if someone familiar with update_ifdata() could explain its logic. When is this routine called? (I can see modules/network/mod_net.c calls it via bytes_in_func(), but I haven't figured out when net_metric_handler() is called. Maybe that would explain how bogus data in ns doesn't matter.) Is there any way to keep way out-of-scale data out of these graphs? Thanks for any help. -Cameron in Los Gatos This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Fw: Network bytes spikes
forgot the list ... -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Forwarded Message From: Martin Knoblauch kn...@knobisoft.de To: Bostjan Skufca bost...@a2o.si Sent: Wed, March 30, 2011 11:42:12 AM Subject: Re: [Ganglia-general] Network bytes spikes Hi Bostjan, yes, the REMOVE_BOGUS_SPIKES workaround is *supposed* to work. It did for me, when I wrote it :-) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Bostjan Skufca bost...@a2o.si To: Vladimir Vuksan vli...@veus.hr Cc: ganglia-general ganglia-general@lists.sourceforge.net Sent: Tue, March 29, 2011 9:25:28 PM Subject: Re: [Ganglia-general] Network bytes spikes That really seems to be the case. Speaking out of my head now but it seems that I only see this on HP DL3x0 with Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) interfaces. I've found some threads... Anyway, does this really work? There is something in code which eliminates 1e^13 and bigger or so it seems... make CPPFLAGS=-DREMOVE_BOGUS_SPIKES b. On 29 March 2011 20:30, Vladimir Vuksan vli...@veus.hr wrote: I see it all the time :-(. According to Bernard this is due to problem with some of the Broadcom cards. Perhaps Bernard can offer more insight. On Tue, 29 Mar 2011 20:23:31 +0200, Bostjan Skufca bost...@a2o.si wrote: Hi, occasionally I notice huge spikes in network graphs in ganglia (petabytes per second or so). Not sure whether those are caused by gmond restarts or network interface byte counter overflows or something else. Is someone else also seeing similar behaviour? Running latest ganglia (3.1.7). b. -- Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Network bytes spikes
Hi Cameron, there are two problems: a) overflow. 32-bit counters will not last very long on 1 Gbit or faster. They should not repord PB spikes though. b) some BMC adapters on Linux-64 had/have a really bad HW bug reporting bogus counters every now and then. That is supposed to be fixed by REMOVE_BOGUS_SPIKES, but only on Linux. But no guarantees. It worked for me on 3.0.7. Cheers Martin-- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Cameron Spitzer cspit...@nvidia.com To: Bostjan Skufca bost...@a2o.si Cc: ganglia-general ganglia-general@lists.sourceforge.net Sent: Tue, March 29, 2011 11:01:24 PM Subject: Re: [Ganglia-general] Network bytes spikes CPPFLAGS=-DREMOVE_BOGUS_SPIKES had no effect in my installation. We eventually found a patch in a non-ganglia forum somewhere, but I can't find it now. It basically added input sanity checking. The problem is a 32-bit counter on a 1 Gbps NIC can overflow in less than gmond's sampling interval. When it overflows, ganglia treats the small negative number as a very large positive. This is a known ganglia bug. It's been around since 2003. You just have to live with it, or try to fix it yourself. -Cameron Bostjan Skufca wrote: That really seems to be the case. Speaking out of my head now but it seems that I only see this on HP DL3x0 with Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) interfaces. I've found some threads... Anyway, does this really work? There is something in code which eliminates 1e^13 and bigger or so it seems... make CPPFLAGS=-DREMOVE_BOGUS_SPIKES b. On 29 March 2011 20:30, Vladimir Vuksan vli...@veus.hr wrote: I see it all the time :-(. According to Bernard this is due to problem with some of the Broadcom cards. Perhaps Bernard can offer more insight. On Tue, 29 Mar 2011 20:23:31 +0200, Bostjan Skufca bost...@a2o.si wrote: Hi, occasionally I notice huge spikes in network graphs in ganglia (petabytes per second or so). Not sure whether those are caused by gmond restarts or network interface byte counter overflows or something else. Is someone else also seeing similar behaviour? Running latest ganglia (3.1.7). b. This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -- Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Multicast/Unicast Poll
- Original Message From: Seth Graham set...@fnal.gov To: Jesse Becker haw...@gmail.com Cc: Ganglia Mailing List ganglia-general@lists.sourceforge.net Sent: Wed, January 12, 2011 10:31:49 PM Subject: Re: [Ganglia-general] Multicast/Unicast Poll On Jan 12, 2011, at 3:12 PM, Jesse Becker wrote: In light of the recent discussions over metadata and unicast vs. multicast, we (meaning Bernard) have created a poll on http://ganglia.info/ to try and gauge the use of each. Please let us know if you use multicast, unicast, or both in your environments. If you have any comments about using one or the other, We used multicast for a long time because it's certainly easy, and ganglia is something multicast is well suited for. But as the years rolled on, firewalls got involved, people became concerned about memory and network usage, and subnet privacy was eroding. We started getting other departments' machines mixed in with our machines, and this caused all kinds of confusion on both sides. Migrating to unicast eliminated the firewall issues, means only a select few machines have to keep metrics in memory, and no more cross talk with other groups. I never saw any solid evidence that ganglia was putting an unfair load on systems, but it was easier to reconfigure than fight it. So the reasons to switch were mostly political. Basically my reasons for using unicast are very much the same. For new installations I will always use UC today. For old installations I am moving from MC to UC if the situation allows. Cheers Martin -- Protect Your Site and Customers from Malware Attacks Learn about various malware tactics and how to avoid them. Understand malware threats, the impact they can have on your business, and how you can protect your company and customers by using code signing. http://p.sf.net/sfu/oracle-sfdevnl ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] last-N-hours view
Hi Ryan, it works as designed :-) your new intervalls do not have proper samples in the RRD database, so the graphs are blown up from the day intervall. You need to tell gmetad to generate samples for the two and three hour intervalls. Something like this in gmetad.conf should do, although I am no specialist. RRAs RRA:AVERAGE:0.5:1:244 RRA:AVERAGE:0.5:24:244 \ RRA:AVERAGE:0.5:2:244 RRA:AVERAGE:0.5:3:244 \ RRA:AVERAGE:0.5:168:244 RRA:AVERAGE:0.5:672:244 \ RRA:AVERAGE:0.5:5760:374 Warning: you need to remove (or move) your old data first. Or you need some rrd-magic to add the new intervalls to the old database. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: 朱韬 ryanzhu...@163.com To: ganglia-general@lists.sourceforge.net Sent: Wed, December 29, 2010 8:36:38 AM Subject: [Ganglia-general] last-N-hours view Hi guys: I enountered the problem that my job lasted for a few hours while ganglia do not support last-N-hours view. So I tried to add to two view model to conf.php as follows: $time_ranges = array( 'hour'=3600, 'twohours'=7200, 'threehours'=10800, 'day'=86400, 'week'=604800, 'month'=2419200, 'year'=31449600 ); But it does not work as it should be. The resultion of the modified model is much lower these orginal ones. Is there any other code to be modified? Thank you ryan zhu -- Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] restarting the gmond collector node causes no data to be reported
From: Cameron L. Spitzer cspit...@nvidia.com To: Bernard Li bern...@vanhpc.org Cc: Louis Coilliot louis.coill...@think.fr; ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Sent: Wed, November 17, 2010 10:36:00 PM Subject: Re: [Ganglia-general] restarting the gmond collector node causes no data to be reported Just out of curiosity, I followed the link in Bernard's message. I didn't find anything related to Russell's question. I followed the link to Current Release Notes, and searched the page for send_metadata_interval, which is cheating, because I would only have Russell's question if I didn't know about send_metadata_interval. Then I followed the link to Ganglia FAQs. Someone who already understood Ganglia pretty well might make the connection between Russells's question ... no metrics are reported anymore and the FAQ Sometimes graphs don't show up for hosts. I doubt a newcomer would see it. That's unclear. Definitely, one of the not-so-strong points of Ganglia is documentation. Frankly I use it for quite some years now, but this behavior/option was new to me. Cheers Martin -- Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today http://p.sf.net/sfu/msIE9-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] restarting the gmond collector node causes no data to be reported
Hi Bernard, - Original Message From: Bernard Li bern...@vanhpc.org To: Louis Coilliot louis.coill...@think.fr Cc: ganglia-general@lists.sourceforge.net Sent: Wed, November 17, 2010 9:16:22 PM Subject: Re: [Ganglia-general] restarting the gmond collector node causes no data to be reported Hello: This is actually documented in both the release notes and the FAQs in our Wiki: http://sourceforge.net/apps/trac/ganglia/wiki Please let us know if anything is unclear. Thanks, Bernard besides that this is really unclear and difficult to find, we may want to consider a different default for unicast mode. It is always better to not let people run into forseeable problems. Cheers Martin On Wed, Nov 17, 2010 at 1:14 PM, Louis Coilliot louis.coill...@think.fr wrote: Hello, this behaviour is reported from time to time with unicast :) Use: send_metadata_interval = 600 (600, for example) on the gmond.conf for your nodes. The metrics should get back after a while. Louis -- Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today http://p.sf.net/sfu/msIE9-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] restarting the gmond collector node causes no data to be reported
- Original Message From: Kostas Georgiou k.georg...@atreides.org.uk To: ganglia-general@lists.sourceforge.net Sent: Thu, November 18, 2010 11:57:29 AM Subject: Re: [Ganglia-general] restarting the gmond collector node causes no data to be reported On Thu, Nov 18, 2010 at 02:44:13AM -0800, Martin Knoblauch wrote: besides that this is really unclear and difficult to find, we may want to consider a different default for unicast mode. It is always better to not let people run into forseeable problems. You can get the same problems with multicast as well, what is the Does this really happen in MC mode? I would call that a bug then. reasoning for the send_metadata_interval=0 default? Can't answer that one. cheers Martin -- Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today http://p.sf.net/sfu/msIE9-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Fw: How can gmetad be configured for 2 clusters?
sorry, forgot the list ... -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Forwarded Message From: Martin Knoblauch kn...@knobisoft.de To: Whit Blauvelt w...@transpect.com Sent: Fri, November 12, 2010 5:35:44 PM Subject: Re: [Ganglia-general] How can gmetad be configured for 2 clusters? Hi Whit, let me guess, all of your machines are running multicast, and all are on the same port? As a result, every gmond will have the complete information for all 8 nodes. That is what you see. Try telnet 192.168.19 8649 and you will see the info of all eight nodes. In order to separate the two clusters, they need to run on different ports. In addition: when you list more than one node on the data_source, this does not define the cluster. I just adds failover capability. gmetad will only talk to one of the hosts at a time. If that fails, it will try the next on the list. Hope this helps a bit Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Whit Blauvelt w...@transpect.com To: ganglia-general@lists.sourceforge.net Sent: Fri, November 12, 2010 4:53:49 PM Subject: [Ganglia-general] How can gmetad be configured for 2 clusters? Hi, Although I've looked through the docs, I must not be looking in the right place. We've added a second cluster, and want to track it as a separate entity from the first. What intuitively seems likely to work doesn't accomplish that. I've tried: - Defining two clusters like this in gmetad.conf: data_source Cluster1 localhost 192.168.19 192.168.1.32 192.168.1.16 data_source Cluster2 192.168.1.24 192.168.1.8 192.168.1.5 192.168.1.6 - And defining the cluster name in each gmond.conf: cluster { name = Cluster1 owner = unspecified latlong = unspecified url = unspecified } The result? The Web front end gives a choice of GridCluster1 or Cluster2, but either choice shows all 8 machines in both clusters. (The only difference is that under Cluster1 the Linux members all have their names shown in the listing, while under Cluster2 the Linux members are shown just by IPs - while the OSX show their names in both cases - but this isn't the show stopper here.) No doubt the right solution is as simple and obvious as the wrong one I've tried. But what is it? All examples I've found assume a single cluster. Thanks, Whit -- Centralized Desktop Delivery: Dell and VMware Reference Architecture Simplifying enterprise desktop deployment and management using Dell EqualLogic storage and VMware View: A highly scalable, end-to-end client virtualization framework. Read more! http://p.sf.net/sfu/dell-eql-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general 另 -- Centralized Desktop Delivery: Dell and VMware Reference Architecture Simplifying enterprise desktop deployment and management using Dell EqualLogic storage and VMware View: A highly scalable, end-to-end client virtualization framework. Read more! http://p.sf.net/sfu/dell-eql-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] How can gmetad be configured for 2 clusters?
- Original Message From: David Birdsong david.birds...@gmail.com To: Whit Blauvelt w...@transpect.com Cc: Martin Knoblauch kn...@knobisoft.de; ganglia-general@lists.sourceforge.net Sent: Fri, November 12, 2010 9:56:26 PM Subject: Re: [Ganglia-general] How can gmetad be configured for 2 clusters? On Fri, Nov 12, 2010 at 9:19 AM, Whit Blauvelt w...@transpect.com wrote: On Fri, Nov 12, 2010 at 08:35:44AM -0800, Martin Knoblauch wrote: In order to separate the two clusters, they need to run on different ports. In addition: when you list more than one node on the data_source, this does not define the cluster. I just adds failover capability. gmetad will only talk to one of the hosts at a time. If that fails, it will try the next on the list. Thanks Martin. That was the whole trick. I was making the assumption that gmetad, being meta, would be the gatherer of data from the nodes. Understanding that the gmonds go ahead and consolidate that changes the picture entirely. As my five-year-old sometimes says, Silly me. Whit While I can't argue against something that clearly fixed this for you, this doesn't sound correct and it would be nice to hear this clarified. Sure every host would have info about every other host, but each host's xml tree should have all the nodes in a nested in their corresponding cluster tags. Gmetad could hit any host and pick up info about both clusters on any host, but it should know to distribute the updates from the xml stream to the correct clusters and not 'cross pollinate' the two. As far as I know, every gmond just puts all the information it has inside its own cluster tags. It does not care about the cluster tags it receives from other gmonds. It has always been the task of gmetad to build up the correct XML for the complete grid. Therefore it is vital that the gmond configuration for multiple clusters is correct. One could argue that this behaviour of gmond needs improvement. One solution could be that it aggregates only data coming from the cluster. On the other hand, the cluster tag is just optional. What should a gmond without such a tag do about data from tagged gmonds? I still favor correct configuration. In any case, I am adding ganglia developers to CC. But the confusion shows, that documentation might be lacking ... Cheers Martin -- Centralized Desktop Delivery: Dell and VMware Reference Architecture Simplifying enterprise desktop deployment and management using Dell EqualLogic storage and VMware View: A highly scalable, end-to-end client virtualization framework. Read more! http://p.sf.net/sfu/dell-eql-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Fw: How can gmetad be configured for 2 clusters?
really adding the developers ... - Forwarded Message From: Martin Knoblauch kn...@knobisoft.de To: David Birdsong david.birds...@gmail.com; Whit Blauvelt w...@transpect.com Cc: ganglia-general@lists.sourceforge.net Sent: Sat, November 13, 2010 8:34:43 AM Subject: Re: [Ganglia-general] How can gmetad be configured for 2 clusters? - Original Message From: David Birdsong david.birds...@gmail.com To: Whit Blauvelt w...@transpect.com Cc: Martin Knoblauch kn...@knobisoft.de; ganglia-general@lists.sourceforge.net Sent: Fri, November 12, 2010 9:56:26 PM Subject: Re: [Ganglia-general] How can gmetad be configured for 2 clusters? On Fri, Nov 12, 2010 at 9:19 AM, Whit Blauvelt w...@transpect.com wrote: On Fri, Nov 12, 2010 at 08:35:44AM -0800, Martin Knoblauch wrote: In order to separate the two clusters, they need to run on different ports. In addition: when you list more than one node on the data_source, this does not define the cluster. I just adds failover capability. gmetad will only talk to one of the hosts at a time. If that fails, it will try the next on the list. Thanks Martin. That was the whole trick. I was making the assumption that gmetad, being meta, would be the gatherer of data from the nodes. Understanding that the gmonds go ahead and consolidate that changes the picture entirely. As my five-year-old sometimes says, Silly me. Whit While I can't argue against something that clearly fixed this for you, this doesn't sound correct and it would be nice to hear this clarified. Sure every host would have info about every other host, but each host's xml tree should have all the nodes in a nested in their corresponding cluster tags. Gmetad could hit any host and pick up info about both clusters on any host, but it should know to distribute the updates from the xml stream to the correct clusters and not 'cross pollinate' the two. As far as I know, every gmond just puts all the information it has inside its own cluster tags. It does not care about the cluster tags it receives from other gmonds. It has always been the task of gmetad to build up the correct XML for the complete grid. Therefore it is vital that the gmond configuration for multiple clusters is correct. One could argue that this behaviour of gmond needs improvement. One solution could be that it aggregates only data coming from the cluster. On the other hand, the cluster tag is just optional. What should a gmond without such a tag do about data from tagged gmonds? I still favor correct configuration. In any case, I am adding ganglia developers to CC. But the confusion shows, that documentation might be lacking ... Cheers Martin -- Centralized Desktop Delivery: Dell and VMware Reference Architecture Simplifying enterprise desktop deployment and management using Dell EqualLogic storage and VMware View: A highly scalable, end-to-end client virtualization framework. Read more! http://p.sf.net/sfu/dell-eql-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad only reads from one node of each data_source
Hi Marc,the output of telnet seems to indicate that your "gmond"s indeed only see their own data. Kind of strange. I have to admit that I have not used MC configurations for quite some time. UC is so much cleaner in my opinion. Questions:a) how many network interfaces do the "nodes"s have?b) if more than one, to which interface is the MC address bound? If not the first, you may want to play with "mcast_if".Output if "ifconfig -a" and "netstat -rn" would be useful.CheersMartin--Martin Knoblauchemail: k n o b i AT knobisoft DOT dewww: http://www.knobisoft.deFrom: Joan Marc Riera marc.ri...@barcelonamedia.orgTo: Martin Knoblauch kn...@knobisoft.deCc: "ganglia-general@lists.sourceforge.net" ganglia-general@lists.sourceforge.netSent: Sat, October 23, 2010 7:17:08 PMSubject: Re: [Ganglia-general] gmetad only reads from one node of each data_source Hi, I have restarted all, for sure. This are the ouputs from the telnet: node01: http://paste.ubuntu.com/518811/ node02: http://paste.ubuntu.com/518812/ I've done the following to get some output. on node1 launch:(/usr/sbin/gmond --debug=10 21 ) /hpcdrive/homemarc.riera/node01.gmond.debug this is the complete output: http://paste.ubuntu.com/518824/ on node02 launch: (/usr/sbin/gmond --debug=10 21 ) /hpcdrive/homemarc.riera/node02.gmond.debug this is the complete output: http://paste.ubuntu.com/518825/ restart gmetad on ganglia server. Ctrl- C on node01 ctrl-c on node02 I've seen both logs and still don't get whats wrong. shame on me. Meaningwhile, Ron, another user on the list suggested me to change something on my gmond.conf udp_recv_channel { family = inet4 port = 8649 } I've tryied, without success. maybe something else should be changed. } On 10/22/2010 02:27 PM, Martin Knoblauch wrote: Hi Marc, on first sight, the configs for node01 and node02 look identical and correct. Have the "gmonds" on all nodes been restarted after the changes (just to be sure :-). What do you get from: "telnet node01 8649" and "telnet node02 8649"? Oh, which version of gmetad/gmond are you running? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Joan Marc Riera marc.ri...@barcelonamedia.org To: Martin Knoblauch kn...@knobisoft.de Cc: "ganglia-general@lists.sourceforge.net" ganglia-general@lists.sourceforge.net Sent: Fri, October 22, 2010 12:51:55 PM Subject: Re: [Ganglia-general] gmetad only reads from one node of each data_source Sorry, I think my response has been discarted because of the attachments. I send it again with my conf files on pastebin. Sorry to bother. My gmond conf has only minor changes. I'm happy to share them . I link(pastebin) to 3 files, gmond from node01 , node02 and nodegpu01. node01: http://pastebin.com/wa9mmT3h node02: http://pastebin.com/ZtwsqnNp nodegpu01 :http://pastebin.com/3ztHULwd As I remember, the only changes I had done are name and owner depending on the Cluster group, and the upd send and recv channel to be different for each Cluster group. Thanks. On 10/22/2010 12:30 PM, Martin Knoblauch wrote: Hi Joan, what you describe sounds fine with regard to "gmetad". "gmetad" will only talk one node per data_source. If that node fails and you have more than one node listed, it will [try to] failover to the next available node. So far, everything is working as expected. Your problem is that apparently each of node01..10 only "knows" its own metrics. Nodes listed on the data_source line need to know the metrics of all nodes in the respective cluster. So it is more a problem with the configuration of your "gmond" services. Care to share the configuration of one of the nodes? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Joan Marc Riera marc.ri...@barcelonamedia.org To: ganglia-general@lists.sourceforge.net Sent: Fri, October 22, 2010 11:50:05 AM Subject: [Ganglia-general] gmetad only reads from one node of each data_source Hello, I have gmetad with following conf running : r...@fbmsgga01:/var/lib/ganglia# cat /etc/ganglia/gmetad.conf |grep -v ^# |grep -v ^$ data_source "CPU cluster" node01 node02 node03 node04 node05 node06 node07 node08 node09 node10 data_source "GPU cluster" nodegpu01 gridname "FBM" r...@fbmsgga01:/var/lib/ganglia# All nodes and gmetad server are on the same vlan. I onl
Re: [Ganglia-general] gmetad only reads from one node of each data_source
Hi Marc,on first sight, the configs for node01 and node02 look identical and correct. Have the "gmonds" on all nodes been restarted after the changes (just to be sure :-). What do you get from: "telnet node01 8649" and "telnet node02 8649"?Oh, which version of gmetad/gmond are you running?CheersMartin --Martin Knoblauchemail: k n o b i AT knobisoft DOT dewww: http://www.knobisoft.deFrom: Joan Marc Riera marc.ri...@barcelonamedia.orgTo: Martin Knoblauch kn...@knobisoft.deCc: "ganglia-general@lists.sourceforge.net" ganglia-general@lists.sourceforge.netSent: Fri, October 22, 2010 12:51:55 PMSubject: Re: [Ganglia-general] gmetad only reads from one node of each data_source Sorry, I think my response has been discarted because of the attachments. I send it again with my conf files on pastebin. Sorry to bother. My gmond conf has only minor changes. I'm happy to share them . I link(pastebin) to 3 files, gmond from node01 , node02 and nodegpu01. node01: http://pastebin.com/wa9mmT3h node02: http://pastebin.com/ZtwsqnNp nodegpu01 :http://pastebin.com/3ztHULwd As I remember, the only changes I had done are name and owner depending on the Cluster group, and the upd send and recv channel to be different for each Cluster group. Thanks. On 10/22/2010 12:30 PM, Martin Knoblauch wrote: Hi Joan, what you describe sounds fine with regard to "gmetad". "gmetad" will only talk one node per data_source. If that node fails and you have more than one node listed, it will [try to] failover to the next available node. So far, everything is working as expected. Your problem is that apparently each of node01..10 only "knows" its own metrics. Nodes listed on the data_source line need to know the metrics of all nodes in the respective cluster. So it is more a problem with the configuration of your "gmond" services. Care to share the configuration of one of the nodes? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Joan Marc Riera marc.ri...@barcelonamedia.org To: ganglia-general@lists.sourceforge.net Sent: Fri, October 22, 2010 11:50:05 AM Subject: [Ganglia-general] gmetad only reads from one node of each data_source Hello, I have gmetad with following conf running : r...@fbmsgga01:/var/lib/ganglia# cat /etc/ganglia/gmetad.conf |grep -v ^# |grep -v ^$ data_source "CPU cluster" node01 node02 node03 node04 node05 node06 node07 node08 node09 node10 data_source "GPU cluster" nodegpu01 gridname "FBM" r...@fbmsgga01:/var/lib/ganglia# All nodes and gmetad server are on the same vlan. I only recieve nodegpu01 and node01 info, but if I stop gmond on node01 I start receiving from node02. If I stop node02 I start receiving from node03, and so on. I do not understant what is happening, everithing was working fine until yesterday, when I restarted gmetad host. data from nodegpu01 is being received and plotted fine. What is going on here? Thanks. Marc -- Joan Marc Riera Duocastella Barcelona Media - Centre d'Innovació Av. Diagonal, 177, planta 9 08018 - BARCELONA Telèfon +34 93 238 14 00 Fax +34 93 309 31 88 www.barcelonamedia.org -- Joan Marc Riera Duocastella Barcelona Media - Centre d'Innovació Av. Diagonal, 177, planta 9 08018 - BARCELONA Telèfon +34 93 238 14 00 Fax +34 93 309 31 88 www.barcelonamedia.org -- Nokia and ATT present the 2010 Calling All Innovators-North America contest Create new apps games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Running multiple gmonds on the same server
Hi Anton, are you using a multiast or unicast setup? Unicast should work just fine. At least it did in 3.0.x. For multicast you *may* also need to run on distinct mc-addresseses in addition to the distinct ports. but I never tested that. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: David Birdsong david.birds...@gmail.com To: Anton Yurchenko ayurche...@gmail.com Cc: ganglia-general@lists.sourceforge.net Sent: Fri, October 15, 2010 1:25:11 AM Subject: Re: [Ganglia-general] Running multiple gmonds on the same server I'm not there anymore, but I think it was 3.1.2. On Thu, Oct 14, 2010 at 4:23 PM, Anton Yurchenko ayurche...@gmail.com wrote: Well that is good to know :) What version of ganglia are you running? Thanks! On 10/14/2010 4:21 PM, David Birdsong wrote: FYI, we did exactly this for ~4-5 clusters at my last installation. It worked fine. On Thu, Oct 14, 2010 at 4:16 PM, Anton Yurchenkoayurche...@gmail.com wrote: Hi all, I am tying to consolidate all the gmond aggregation nodes for 3 clusters that we have on a pair of servers. I tried to have gmond for each cluster run on it own set of ports, but its not working very well. In ganlia UI for the clusters I can see the number of hosts is correct, but none of the other metrics are showing. Is this not the right approach for running gmond for multiple clusters? Thanks! Anton -- Download new Adobe(R) Flash(R) Builder(TM) 4 The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly Flex(R) Builder(TM)) enable the development of rich applications that run across multiple browsers and platforms. Download your free trials today! http://p.sf.net/sfu/adobe-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Download new Adobe(R) Flash(R) Builder(TM) 4 The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly Flex(R) Builder(TM)) enable the development of rich applications that run across multiple browsers and platforms. Download your free trials today! http://p.sf.net/sfu/adobe-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general e_ -- Download new Adobe(R) Flash(R) Builder(TM) 4 The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly Flex(R) Builder(TM)) enable the development of rich applications that run across multiple browsers and platforms. Download your free trials today! http://p.sf.net/sfu/adobe-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Running multiple gmonds on the same server
Somehow Anton got lost ... - Original Message From: Martin Knoblauch kn...@knobisoft.de To: David Birdsong david.birds...@gmail.com Cc: David Birdsong david.birds...@gmail.com; ganglia general ganglia-general@lists.sourceforge.net Sent: Fri, October 15, 2010 9:31:41 AM Subject: Re: [Ganglia-general] Running multiple gmonds on the same server Hi Anton, are you using a multiast or unicast setup? Unicast should work just fine. At least it did in 3.0.x. For multicast you *may* also need to run on distinct mc-addresseses in addition to the distinct ports. but I never tested that. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: David Birdsong david.birds...@gmail.com To: Anton Yurchenko ayurche...@gmail.com Cc: ganglia-general@lists.sourceforge.net Sent: Fri, October 15, 2010 1:25:11 AM Subject: Re: [Ganglia-general] Running multiple gmonds on the same server I'm not there anymore, but I think it was 3.1.2. On Thu, Oct 14, 2010 at 4:23 PM, Anton Yurchenko ayurche...@gmail.com wrote: Well that is good to know :) What version of ganglia are you running? Thanks! On 10/14/2010 4:21 PM, David Birdsong wrote: FYI, we did exactly this for ~4-5 clusters at my last installation. It worked fine. On Thu, Oct 14, 2010 at 4:16 PM, Anton Yurchenkoayurche...@gmail.com wrote: Hi all, I am tying to consolidate all the gmond aggregation nodes for 3 clusters that we have on a pair of servers. I tried to have gmond for each cluster run on it own set of ports, but its not working very well. In ganlia UI for the clusters I can see the number of hosts is correct, but none of the other metrics are showing. Is this not the right approach for running gmond for multiple clusters? Thanks! Anton -- Download new Adobe(R) Flash(R) Builder(TM) 4 The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly Flex(R) Builder(TM)) enable the development of rich applications that run across multiple browsers and platforms. Download your free trials today! http://p.sf.net/sfu/adobe-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Download new Adobe(R) Flash(R) Builder(TM) 4 The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly Flex(R) Builder(TM)) enable the development of rich applications that run across multiple browsers and platforms. Download your free trials today! http://p.sf.net/sfu/adobe-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general e_ -- Download new Adobe(R) Flash(R) Builder(TM) 4 The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly Flex(R) Builder(TM)) enable the development of rich applications that run across multiple browsers and platforms. Download your free trials today! http://p.sf.net/sfu/adobe-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Does Ganglia measure itself?
Hi Weston, gmond just looks at the low-level counters provided by the OS and has no awareness about its own resource usage. So, it will collect cpu-usage including its own cycles. Does this answer your question? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Stevens, Weston J weston.j.stev...@boeing.com To: ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Sent: Mon, September 20, 2010 8:20:21 PM Subject: [Ganglia-general] Does Ganglia measure itself? For instance, if gmetad and gmond are using a few percent of CPU, would this show up on the CPU usage graph? Or does it ignore itself and only count everything else? Thanks -- Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general mG -- Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Multiple clusters with unicast
Jonathan, I do it this way. - run the gmonds on each cluster on a dedicated port (per cluster) - let them cast their messages to a dedicated aggregator gmond for each cluster - let gmetad query those aggregators on their dedicated ports If you want to have one host in different clusters, you can run two gmonds on that host, with different port. I never did that, but it should work Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Jonathan Weiss j...@innerewut.de To: ganglia-general@lists.sourceforge.net Sent: Fri, July 16, 2010 11:50:08 AM Subject: [Ganglia-general] Multiple clusters with unicast Cheers, I'm using Ganglia with unicast on EC2 (so there is no chance for multicast). I have a typical web-app with load balancers, app servers and database servers. Everything is working fine as one Ganglia cluster with unicast by having all local gmonds using udp_send to send to one monitoring server running gmond gmetad. My problem is now that I would like to list the different roles in my cluster in Ganglia. So that I get a CPU overview for all app-servers separated from the CPU report for the DB servers. I've tried doing this by setting a different cluster name in the local gmonds. But it looks like whatever cluster name I have in the gmond of the Monitoring server is overriding this so I end up having only one cluster. Is there a way of doing this without having gmetad query all gmonds? BTW can one host be in multiple clusters? So if I have a server that is a app-server and a memcached server could I have it listed in both clusters? Regards, Jonathan -- Jonathan Weiss http://blog.innerewut.de http://twitter.com/jweiss -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] bytes_in (and bytes_out): instantaneous or averaged?
-- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: David Barnes david.g.bar...@gmail.com To: ganglia-general@lists.sourceforge.net Sent: Wed, July 7, 2010 1:59:13 AM Subject: [Ganglia-general] bytes_in (and bytes_out): instantaneous or averaged? Hi all, I am planning to use historical archives of ganglia data for our cluster to document its utilisation history and guide our next upgrade. I would like to understand the bytes_in and bytes_out metrics a bit better. Are they instantaneous, or average, measurements? Ie. say my gmond polling time is 5 seconds. If the following happens, with nothing else going on of significance: @ t = 0 second, gmond polls metrics (poll0) @ t = 1 second, 1Mbyte transferred in (effectively instantly) @ t = 3s, 5Mbyte transferred out (effectively instantly) @ t = 5s, gmond polls metrics (poll1) What is going to be stored in bytes_in and bytes_out for poll1? Will it be the *average* (integrated) throughput: bytes_in: 1Mbyte / 5s = 200kByte/s = bytes_in = 20 bytes_out: 5Mbyte / 5s = 1000kBytes/s = bytes_out = 100 Or will it be the instantaneous throughput measured at the time of poll1, ie. both bytes_in and bytes_out = 0 because there is no instantaneous activity? Another way of asking the same question: is it valid to deduce long-term (aggregate) data transfer volumes from the rates expressed by bytes_in and bytes_out? Thanks very much in advance - David Barnes. -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- href=http://p.sf.net/sfu/sprint-com-first; target=_blank http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list href=mailto:Ganglia-general@lists.sourceforge.net;Ganglia-general@lists.sourceforge.net href=https://lists.sourceforge.net/lists/listinfo/ganglia-general; target=_blank https://lists.sourceforge.net/lists/listinfo/ganglia-general -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad xml output is incomplete sometimes
Hi Miguel, good to know, that age hasn't stopped my memory from working :-) Maybe this asks for documentation. Cheers Martin-- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Miguel A. Díaz Corchero miguelangel.d...@ciemat.es To: Martin Knoblauch kn...@knobisoft.de Cc: Bernard Li bern...@vanhpc.org; ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Sent: Tue, June 29, 2010 8:17:51 AM Subject: Re: [Ganglia-general] gmetad xml output is incomplete sometimes Thanks Martin. Your solution solves my problem. El lun, 28-06-2010 a las 03:15 -0700, Martin Knoblauch escribió: Hi Miguel, just to rule that out: check the data_source lines in your gmetad.conf to make sure that gmetad is not querying its own XML port. That could result in incomplete/broken XML. And yes, we have seen it before :-) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: href=http://www.knobisoft.de; target=_blank http://www.knobisoft.de - Original Message From: Miguel A. Díaz Corchero ymailto=mailto:miguelangel.d...@ciemat.es; href=mailto:miguelangel.d...@ciemat.es;miguelangel.d...@ciemat.es To: Bernard Li href=mailto:bern...@vanhpc.org;bern...@vanhpc.org Cc: ymailto=mailto:ganglia-general@lists.sourceforge.net; href=mailto:ganglia-general@lists.sourceforge.net;ganglia-general@lists.sourceforge.net href=mailto:ganglia-general@lists.sourceforge.net;ganglia-general@lists.sourceforge.net Sent: Mon, June 28, 2010 8:27:50 AM Subject: Re: [Ganglia-general] gmetad xml output is incomplete sometimes Hi Bernard. Now, I'm only monitoring 5 host. -2/5 are switches and only have 3 metrics. To do that I'm using 3 gmetric call every minute. -3/5 are hosts with the default metrics and default time values. The problem appears in both cases: switches and hosts. Seeing debug mode of gmetad, I noticed 3 events (updating, writing, clearing). Maybe those events are relationed with my problem (perhaps clearing event). Thanks, Miguel. El vie, 25-06-2010 a las 10:57 -0700, Bernard Li escribió: Hi Miguel: How many hosts and metrics are you monitoring with your gmetad? Cheers, Bernard 2010/6/25 Miguel A. ymailto=mailto: href=mailto:miguelangel.d...@ciemat.es;miguelangel.d...@ciemat.es href=mailto: href=mailto:miguelangel.d...@ciemat.es;miguelangel.d...@ciemat.es ymailto=mailto:miguelangel.d...@ciemat.es; href=mailto:miguelangel.d...@ciemat.es;miguelangel.d...@ciemat.es: Hi. I'm getting the XML output from gmetad and saving it in a file. Sometimes, the output XML has more machine than others. For example, At 2 p.m the xml output is grid cluster1 host 1 host 2 host 3 /cluster1 /grid And one minute later, the xml output is (for example) grid cluster1 host 1 /cluster1 /grid But other minute later, the xml output is (for example) grid cluster1 host 1 host 2 host 3 /cluster1 /grid I have revised that hosts were running and they were ok. I think gmetad only shows updated data, but I'm not sure. Do you know why gmetad occassionally shows some piece of data and not all of them? Regards Miguel. Confidencialidad: Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucción. Disclaimer: This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately. -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad
Re: [Ganglia-general] gmetad xml output is incomplete sometimes
Hi Miguel, just to rule that out: check the data_source lines in your gmetad.conf to make sure that gmetad is not querying its own XML port. That could result in incomplete/broken XML. And yes, we have seen it before :-) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Miguel A. Díaz Corchero miguelangel.d...@ciemat.es To: Bernard Li bern...@vanhpc.org Cc: ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Sent: Mon, June 28, 2010 8:27:50 AM Subject: Re: [Ganglia-general] gmetad xml output is incomplete sometimes Hi Bernard. Now, I'm only monitoring 5 host. -2/5 are switches and only have 3 metrics. To do that I'm using 3 gmetric call every minute. -3/5 are hosts with the default metrics and default time values. The problem appears in both cases: switches and hosts. Seeing debug mode of gmetad, I noticed 3 events (updating, writing, clearing). Maybe those events are relationed with my problem (perhaps clearing event). Thanks, Miguel. El vie, 25-06-2010 a las 10:57 -0700, Bernard Li escribió: Hi Miguel: How many hosts and metrics are you monitoring with your gmetad? Cheers, Bernard 2010/6/25 Miguel A. ymailto=mailto:miguelangel.d...@ciemat.es; href=mailto:miguelangel.d...@ciemat.es;miguelangel.d...@ciemat.es: Hi. I'm getting the XML output from gmetad and saving it in a file. Sometimes, the output XML has more machine than others. For example, At 2 p.m the xml output is grid cluster1 host 1 host 2 host 3 /cluster1 /grid And one minute later, the xml output is (for example) grid cluster1 host 1 /cluster1 /grid But other minute later, the xml output is (for example) grid cluster1 host 1 host 2 host 3 /cluster1 /grid I have revised that hosts were running and they were ok. I think gmetad only shows updated data, but I'm not sure. Do you know why gmetad occassionally shows some piece of data and not all of them? Regards Miguel. Confidencialidad: Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucción. Disclaimer: This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately. -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: href=http://p.sf.net/sfu/thinkgeek-promo; target=_blank http://p.sf.net/sfu/thinkgeek-promo ___ Ganglia-general mailing list ymailto=mailto:Ganglia-general@lists.sourceforge.net; href=mailto:Ganglia-general@lists.sourceforge.net;Ganglia-general@lists.sourceforge.net target=_blank https://lists.sourceforge.net/lists/listinfo/ganglia-general -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- href=http://p.sf.net/sfu/sprint-com-first; target=_blank http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list href=mailto:Ganglia-general@lists.sourceforge.net;Ganglia-general@lists.sourceforge.net href=https://lists.sourceforge.net/lists/listinfo/ganglia-general; target=_blank https://lists.sourceforge.net/lists/listinfo/ganglia-general -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https
Re: [Ganglia-general] Ganglia Cluster grouping issues...
From: Nitin Bharadwaj west.ni...@gmail.com To: Ofer Inbar c...@a.org Cc: ganglia-general@lists.sourceforge.net Sent: Wed, March 10, 2010 10:30:32 AM Subject: Re: [Ganglia-general] Ganglia Cluster grouping issues... Kool! I did just that, but another additional thing (when wiping out the RRD didnt help at all): additional lines in gmond.conf for cluster-B trusted_hosts = IP Address of gmetad all_trusted = on Now, IT WORKS!! THANKS A LOT FOLKS!! REALLY APPRECIATE YOUR PATIENCE AND TIME!! :-) Good that it works. Did you have to make similar changes for cluster-H? I am a bit surprised, that those lines are necessary. What is the IP address of your gmetad host? Cheers Martin -- Download Intel#174; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia Cluster grouping issues...
- Original Message From: Nitin Bharadwaj nitin.bharad...@mkhoj.com To: ganglia-general@lists.sourceforge.net Sent: Tue, March 9, 2010 10:03:03 AM Subject: [Ganglia-general] Ganglia Cluster grouping issues... Hi, I have a scenario as belows (might be a silly one, I'm not used to the configs of Ganglia yet): I have h1-h7 hosts, which need to go Under Cluster-H in Ganglia and similarly, b1-b3, which need to go to Cluster-B. Now, here is what my gmond.conf (for both host groups) and gmetad.conf look like: h1-h7 gmond.conf: name Cluster-H (remaining default) b1-b3 gmond.conf: name Cluster-B (remaining default) gmetad.conf: data_source Cluster-B b1 b2 b3 data_source Cluster-H h1 h2 h3 h4 h5 h6 h7 Now, Whatever I do, I see all these 10 hosts (h1-h7 and b1-b3) under both Cluster-H and Cluster-B. How do I get this resolved? Any help will be greatly appreciated. Thanks, Nitin Hi Nitin, your scenario is not silly at all. My guess is that all of your hosts operate their gmonds on the same UDP channel. You need to use different ports for b1-b3 and h1-h7. Lets say you change b1-b3 to port 9649 (in gmond.conf), your gmetad configuration should look like: data_source Cluster-B b1:9649 b2:9649 b3:9649 data_source Cluster-H h1 h2 h3 h4 h5 h6 h7 Btw. it is sufficient to name just one of the hosts on the data_source line. The others are only queried if the first one fails. Cheers Martin -- Download Intel#174; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] web front end is receiving cut-off XML
From: Maes, Richard rm...@ciena.com To: Bernard Li bern...@vanhpc.org Cc: ganglia-general@lists.sourceforge.net Sent: Wed, March 3, 2010 1:04:42 AM Subject: Re: [Ganglia-general] web front end is receiving cut-off XML Bernard, my bad for the poor information. I’m using ports 8649 and 8651. From gmetad.conf from my concentrator data_source wagrid waxgridqm.ciena.com:8651 me thinks above should be 8649. As it is now, gmetad is querying itself. gridname wagrid xml_port 8651 From my gmond.conf file that I use across all clients and my concentrator. /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { host = waxgridqm.ciena.com port = 8649 ttl = 1 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { port = 8649 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 } From:Bernard Li [mailto:bern...@vanhpc.org] Sent: Tuesday, March 02, 2010 12:47 PM To: Maes, Richard Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] web front end is receiving cut-off XML Hi Richard: On Fri, Feb 26, 2010 at 10:44 AM, Maes, Richard rm...@ciena.com wrote: I have been having a problem with my web front end 3.1.2 where many of my hosts do or don’t show up in the web GUI. What OS are you running? If I do a telnet localhost 8650 or 8651, I get full uncorrupted XML output with a message at the bottom that says “Connection closed by foreign host.” Did you mean 8651 and 8652? 8650 is not a standard Ganglia port. Can you please post the data_source line in your gmetad.conf file? One thing you can do to troubleshoot the problem is lower the number of hosts in your cluster and see if the situation changes. Also, try to see if you can isolate a host that could potentially be causing this issue? Cheers, Bernard -- Download Intel#174; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond memory leaks
-- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Scott Dworkis s...@mylife.com To: Martin Knoblauch kn...@knobisoft.de Cc: ganglia-general@lists.sourceforge.net Sent: Wed, March 3, 2010 5:21:32 AM Subject: Re: [Ganglia-general] gmond memory leaks finally had some time to do a few attempts at valgrind... so far it doesn't seem to be telling me much... the numbers it reports are in the megabyte and not gigabyte range that i'm seeing. after a couple hours of valgrind i see: ==31952== LEAK SUMMARY: ==31952==definitely lost: 532 bytes in 23 blocks. ==31952==indirectly lost: 271 bytes in 16 blocks. ==31952== possibly lost: 13,872 bytes in 30 blocks. ==31952==still reachable: 1,626,182 bytes in 2,188 blocks. ==31952== suppressed: 0 bytes in 0 blocks. ==31952== Reachable blocks (those to which a pointer was found) are not shown. ==31952== To see them, rerun with: --leak-check=full --show-reachable=yes this doesn't grow much even after valgrinding overnight ==24957== LEAK SUMMARY: ==24957==definitely lost: 2,404 bytes in 179 blocks. ==24957==indirectly lost: 271 bytes in 16 blocks. ==24957== possibly lost: 13,872 bytes in 30 blocks. ==24957==still reachable: 1,626,182 bytes in 2,188 blocks. ==24957== suppressed: 0 bytes in 0 blocks. in fact most of these numbers are identical, so they must be fixed losses in terms of valgrind accounting. Did you try --leak-check=full --show-reachable=yes. I believe that is supposed to show all allocations. Might be a bit of output, but as far as I can see you are able to reproduce early. this does not really reflect the growth of my gmond process (running under valgrind here, so reported as memcheck), which i tracked with 5 minute samples of top for an hour, shows a linear leak of over 1GB during that period: (s...@admin3:16:43:/home/admin/monitoring/scripts) while [ 1 ];do top -n 1 | grep mem;sleep 300;done 24957 nobody20 0 5623m 3.5g 3648 R 80 11.1 121:49.98 memcheck 24957 nobody20 0 5753m 3.6g 3648 R 76 11.4 126:43.25 memcheck 24957 nobody20 0 5948m 3.7g 3652 R 101 11.8 131:36.26 memcheck 24957 nobody20 0 6108m 3.8g 3652 R 99 12.1 136:29.35 memcheck 24957 nobody20 0 6267m 3.9g 3652 R 97 12.4 141:17.02 memcheck 24957 nobody20 0 6436m 4.0g 3652 R 97 12.7 146:07.58 memcheck 24957 nobody20 0 6547m 4.1g 3652 R 63 13.0 150:56.74 memcheck 24957 nobody20 0 6707m 4.2g 3652 R 99 13.3 155:47.88 memcheck 24957 nobody20 0 6917m 4.3g 3652 R 99 13.7 160:40.30 memcheck 24957 nobody20 0 7055m 4.4g 3652 R 97 14.0 165:28.40 memcheck 24957 nobody20 0 7201m 4.5g 3652 R 101 14.3 170:20.32 memcheck 24957 nobody20 0 7340m 4.6g 3652 R 99 14.6 175:08.75 memcheck if i understand valgrind right, it's only orphaned data that's counted as lost... perhaps some structure is not orphaned but bloating? one other accidental observation, i have a job that generates 70k metrics every 5 minutes (a few dozen for every port on each of our switches)... these are all spoof ip metrics. this job had been accidentally disabled for a few days and i noticed that the leak virtually stopped. i can play some more with various parameters of this job and see if i find anything more... could be the spoof thing is coincidental but Rick Cobb also mentioned his leak seemed to be spoof related. i'll also see if sending heartbeats for the spoof ips helps anything. spoofing might indeed be a hint. Martin -scott Message: 2 Date: Thu, 18 Feb 2010 07:15:33 -0800 (PST) From: Martin Knoblauch Subject: Re: [Ganglia-general] gmond memory leaks To: Scott Dworkis Cc: ganglia-general@lists.sourceforge.net Message-ID: 880015.28351...@web113306.mail.gq1.yahoo.com Content-Type: text/plain; charset=us-ascii - Original Message From: Scott Dworkis To: Martin Knoblauch Cc: ganglia-general@lists.sourceforge.net Sent: Wed, February 17, 2010 8:32:32 PM Subject: Re: [Ganglia-general] gmond memory leaks 3.1.2 on gentoo (that solaris must be a sourceforge ad?). i have zero experience with valgrind... i'll have a look but a smidge of guidance would be appreciated. :) Just get valgrind and run the leaking gmond under its control. gmond should be configured to not run in background. After some time interrupt it and you will get a report of valgrinds findings. For example, a simple program leaking 8x1MB will produce: [mknob...@l6g0223j ~]$ valgrind ./memeat ==13647== Memcheck, a memory error detector. ==13647== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al. ==13647== Using LibVEX rev 1658, a library for dynamic binary translation. ==13647== Copyright (C) 2004-2006, and GNU GPL'd
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.7 ready for testing
- Original Message From: Daniel Pocock dan...@pocock.com.au To: kn...@knobisoft.de Cc: ganglia-develop...@lists.sourceforge.net; ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Sent: Tue, March 2, 2010 12:23:32 PM Subject: Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.7 ready for testing Thanks to those who provided feedback - any objections to making 3.1.7 generally available? I would like to make it GA within the next 1-2 days now. unless there is a [severe] regression compared to 3.1.2 - just let it escape. You know, the perfect is the enemy of the good. Cheers Martin Michael Perzl wrote: I have successfully compiled and tested 3.1.7 on - AIX 5.1 ML04 - AIX 5.3 ML00 - AIX 5.3 TL07 - AIX 6.1 TL03 Regards, Michael On 02/22/2010 12:15 PM, Daniel Pocock wrote: Just a reminder - any feedback is welcome, or feel free to discuss 3.1.7 on IRC It would be good to have positive confirmation of which platforms this has been tested on, so far, I have tested - Debian lenny, - RHEL3/4/5, - CentOS 5, - Solaris 8 and - Cygwin. and Brad has done some testing on SLES10 Regards, Daniel Daniel Pocock wrote: I've tagged 3.1.7 and built a tarball: http://ganglia.info/testing/ganglia-3.1.7.tar.gz The md5sum for 3.1.7 is: 6aa5e2109c2cc8007a6def0799cf1b4c Since 3.1.6, only two things have changed and may need to be tested again by those who tested 3.1.6: - the build system (support for commas in CFLAGS) - the multicpu module - percentages reported differently This is not confirmation that the release is in GA status - a further notification will be sent when the testing period has elapsed without any serious defect. Users are invited to test the tarball and submit feedback. Please do not commit on branches/monitor-core-3.1 until after 3.1.7 goes GA, in case further tweaks are needed to facilitate a successful release. Below are the release notes from the STATUS file. Other documentation has also changed since 3.1.2 and should be reviewed: GANGLIA 3.1 STATUS: -*-text-*- Last modified at [$Date: 2010-02-17 11:01:08 + (Wed, 17 Feb 2010) $] The current version of this file can be found at: * http://ganglia.svn.sourceforge.net/svnroot/ganglia/branches/monitor-core-3.1/STATUS Release history: 3.1.7 : Tagged: Feb 17, 2010 3.1.6 : Tagged: Feb 4, 2010 (not released for GA) 3.1.5(hargrave) : Tagged: Nov 24, 2009 (not released for GA) 3.1.4(hargrave) : Tagged: Oct 26, 2009 (not released for GA) 3.1.3(avenger): Tagged: Sep 19, 2009 (not released for GA) 3.1.2(langley): Released: Feb 17, 2009 3.1.1(wien) : Released: Sep 10, 2008 3.1.0(amelia) : Released: Jul 30, 2008 Contributors looking for a mission: * Just do an egrep on TODO, XXX or FIXME in the source. * Review the bug database at: http://bugzilla.ganglia.info/ * Open bugs in the bug database. * Implement a feature from the wishlist at: http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_wish-list CURRENT RELEASE NOTES: (Please update this area with a brief description of bug fixes and enhancements that have been backported for the current release) Note: 3.1.3, 3.1.4, 3.1.5 and 3.1.6 never became GA, therefore, the release notes for all of them are combined below. 3.1.7: * Fix build support for RHEL5/issue with commas in CFLAGS * multicpu module: show CPU utilization as a value between 0-100% for each core 3.1.6: * Merge commit 1966 from trunk to fix contrib/removespikes.pl * Bootstrapping with Debian 5.0 (lenny) versions of autotools for this and future releases. http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg05352.html http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg04688.html * Require user to explicitly specify sysconfdir when building from source, due to the fact that the old behavior was not consistent with the documented behavior. * Configuration files and scripts are now created during the install phase rather than during configure. This allows values such as @sysconfdir@ to be used in the template configuration files. * Abolish the use of release names - only release numbers will be used to distinguish versions in future * libmetrics: workaround system header conflict in DFBSD= 2.4 (BUG245) * Use PCRE regex matching to configure metrics using the name_match directive * rrdcached support * gmetad now uses apr and the sleep intervals between polls are randomized in a way that supports shorter polling intervals * FreeBSD support: fixes for crashes
Re: [Ganglia-general] replaced a host, new host not seen
Original Message From: Rick Cobb rc...@quantcast.com To: Cameron Spitzer cspit...@nvidia.com Cc: ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Sent: Sat, February 27, 2010 4:03:05 AM Subject: Re: [Ganglia-general] replaced a host, new host not seen Well, one cause of the confusion is your /etc/ganglia/gmetad.conf data_source entry. It should *only* have the address of gmonds that collect all metrics for a cluster, and only one of your gmonds is doing that. Correct, listing gmonds that do not have all the information is the way to desaster. The Ganglia architecture can be very confusing. A 'gmond' has 3 tasks, and all but one of yours are only doing one of them: * Measure things about the local host and send them to the 'udp_send_channel'. which in case of multicast means send to every gmond that cares (is listening). In the case of unicast, it sends to *all* udp_send_channels. This is what I usually do: have two servers acting as headnodes for the monitoring. All monitoring clients have two udp_send_channels, sending their data to the two headnodes. I call these gmonds collectors, as they collect the data in the first place. And I made a mistake in my reply :-( * Receive measurements from any gmond (even itself) or gmetric on the 'udp_recv_channel' and put them in a local datastructure, which is basically a set (hash) of hosts with a set of current metrics per host. This is the step that resolves addresses to names. * Answer requests from gmetad for the whole cluster's metrics. (It does this on the tcp_accept_channel). Gmond just serializes the whole metrics datastructure into an XML document as the reply. In my usualy setup, these two functionalities reside on the headnode gmonds, which I call aggregators. If you have all your gmonds sending to one unicast address, only one of your gmonds *has* all the metrics for that cluster. That's what Martin called designated as a collector. In that case, your data_source line should only Actually I wanted to write aggregator for these gmonds. include that gmond (host). Adding the others can only cause problems -- if the first gmond fails, your gmetad will contact the second one in the list, and that won't actually have any metrics on it, since no one (including itself) is sending it any. All your nodes will (gradually as timeouts expire) appear to be down. 'gmond' will expire hosts if your gmond.conf has a non-zero 'host_dmax' entry (see http://linux.die.net/man/5/gmond.conf, among others). 'gmetad' is an entirely different beast from gmond; sometimes I think it was written by a completely different team. It polls your gmonds, writes the numeric metric values to RRDtool files, and responds to queries for (subsets of) metrics so front-ends can present them. It has *no* relationship with your udp_send_channel or udp_receive_channel; also, it has almost no (AFAIK) relationship to your network infrastructure -- it doesn't reverse-lookup addresses, for example. On the other hand, it does combine all the metrics for a cluster into a long-term in memory data structure, and then combines those into a single 'grid-level' datastructure. In gmetad, metrics (including a last-heard-from metric ('RECORDED') for a host) can expire, but hosts just go 'down'; they never go away. So: if you haven't set a host_dmax, you have to stop gmetad, restart every gmond that the gmetad could talk to (i.e., everything on the data_source line), start gmetad. In your case, there's only one gmond that gmetad should talk to, so simplify your life by removing the rest from your data_source line. I'd set host_dmax, too, but that's a matter of taste. -- ReC On Feb 26, 2010, at 12:22 PM, Cameron Spitzer wrote: I was able to remove the dead host (that isn't really dead) from the overview display. I had to kill all gmond's everywhere, and the gmetad. Then I removed the rrd files for the dead host from gmetad's rrds directory, and the rrd directory itself. Then I removed the dead host's IP address from gmetad.conf. Then I brought up all the gmonds (except the dead one) and then the gmetad. Apparently, these steps will have to be added to our failover procedure. Martin Knoblauch wrote: ... Also, just to better understand the situation, what is the exact setup? Is one of the gmonds designated as a collector? Or do all gmonds carry all metrics from all hosts? Which gmond is queried by gmetad (snippet from config file)? You should telnet/nc to that gmond and check whether it has current metrics from B. I don't know what designated as a collector means. s/collector/aggregator/ ans see above. Nor do I know how to control which gmonds carry all metrics from which hosts. There is only one udp_send_channel in gmond.conf, and the host
Re: [Ganglia-general] replaced a host, new host not seen
From: Ramon Bastiaans ramon.bastia...@sara.nl To: Cameron Spitzer cspit...@nvidia.com Cc: ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Sent: Fri, February 26, 2010 9:14:58 AM Subject: Re: [Ganglia-general] replaced a host, new host not seen On 02/26/2010 02:46 AM, Cameron Spitzer wrote: Bernard Li wrote: Same hostname too I presume? On gmetad, your hosts show up with hostnames, correct? Yes, same hostname. Is it perhaps showing up in the gmetad/web by it's IP address in stead of it's hostname? That might indicate a DNS/hostname issue. Also make sure the newly replaced gmond host is not set to mute in the gmond.conf Telnet from the master to the new host gives an XML document, same as the old one. What I would test is telnet (or nc) from master to _another_ host and make sure that it has metrics from the new host. I don't understand that at all. Host A is running gmetad. Host B (gmond) is not getting graphed, even though it sends XML. Hosts C through W are working fine. How would telnet from A to C tell me what's wrong with B? When using multicast, all other gmond's contain the information of the other gmond's. Since you are using unicast that is not the case here. Why would host C know anything about host B? Should any gmond host have information about all the other gmond hosts? In any case, the telnet output is the same from B and from C. There is no reference to any hosts in it. Are you using multicast (default) or unicast?\ Unicast. Is the route from gmond host B to gmetad host A set correctly? Perhaps the gmond traffic is getting sent over the wrong interface. When in doubt I tend to use tcpdump myself to verify the traffic is getting sent. Also, just to better understand the situation, what is the exact setup? Is one of the gmonds designated as a collector? Or do all gmonds carry all metrics from all hosts? Which gmond is queried by gmetad (snippet from config file)? You should telnet/nc to that gmond and check whether it has current metrics from B. Cheers Martin -- Download Intel#174; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond memory leaks
- Original Message From: Scott Dworkis s...@mylife.com To: Martin Knoblauch kn...@knobisoft.de Cc: ganglia-general@lists.sourceforge.net Sent: Wed, February 17, 2010 8:32:32 PM Subject: Re: [Ganglia-general] gmond memory leaks 3.1.2 on gentoo (that solaris must be a sourceforge ad?). i have zero experience with valgrind... i'll have a look but a smidge of guidance would be appreciated. :) Just get valgrind and run the leaking gmond under its control. gmond should be configured to not run in background. After some time interrupt it and you will get a report of valgrinds findings. For example, a simple program leaking 8x1MB will produce: [mknob...@l6g0223j ~]$ valgrind ./memeat ==13647== Memcheck, a memory error detector. ==13647== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al. ==13647== Using LibVEX rev 1658, a library for dynamic binary translation. ==13647== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. ==13647== Using valgrind-3.2.1, a dynamic binary instrumentation framework. ==13647== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al. ==13647== For more details, rerun with: -v ==13647== ^C ==13647== ==13647== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1) ==13647== malloc/free: in use at exit: 8,000,000 bytes in 8 blocks. ==13647== malloc/free: 8 allocs, 0 frees, 8,000,000 bytes allocated. ==13647== For counts of detected errors, rerun with: -v ==13647== searching for pointers to 8 not-freed blocks. ==13647== checked 66,440 bytes. ==13647== ==13647== LEAK SUMMARY: ==13647==definitely lost: 8,000,000 bytes in 8 blocks. ==13647== possibly lost: 0 bytes in 0 blocks. ==13647==still reachable: 0 bytes in 0 blocks. ==13647== suppressed: 0 bytes in 0 blocks. ==13647== Use --leak-check=full to see details of leaked memory. If you use --leak-check=full, it will tell you where the leaking memory was allocated. gmond needs to be compiled with debug info (-g). A few questions. - What is your setup? I assume quite a few hosts monitoring (collectors) metrics and one aggregating the results. - Which of the gmonds leak? The collectors, the aggregator or both? Cheers Martin yeah 150k metrics is a lot... i have an interest in scaling this thing. i'll post another thread bout things i've done to scale so far that seem to be working well. On Wed, 17 Feb 2010, Martin Knoblauch wrote: Hi Scott, which version of Ganglia and which operating environment do you have (guessing Solaris from your signature :-)? Any chance that you could run valgrind or equivalent on your setup? 10GB/day is a lot, as is 150k metrics. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Scott Dworkis To: ganglia-general@lists.sourceforge.net Sent: Wed, February 17, 2010 3:08:26 AM Subject: [Ganglia-general] gmond memory leaks (sorry if this is a repost... i tried previously without having first subscribed to the list, and fear i got lost somewhere along the moderation path) hi all - i am seeing gmond leak about 10GB/day on about 150k metrics collected. it seemed like things worsened when i added dmax to all my custom metrics, but maybe it was always bad. is this a known issue? sorry if it is already known... i couldn't see that there was a good way to search the forums or if there is a bug tracker to search. -scott -- SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Download Intelreg; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Fw: any workaround for the bogus spikes problem?
forgot the list ... - Forwarded Message From: Martin Knoblauch kn...@knobisoft.de To: Cameron Spitzer cspit...@nvidia.com Sent: Wed, February 3, 2010 11:48:10 AM Subject: Re: [Ganglia-general] any workaround for the bogus spikes problem? From: Cameron Spitzer To: kn...@knobisoft.de Cc: ganglia-general@lists.sourceforge.net Sent: Tue, February 2, 2010 6:49:52 PM Subject: Re: [Ganglia-general] any workaround for the bogus spikes problem? Martin Knoblauch wrote: We're trying to use Ganglia to monitor some HP DL580-G5 machines. We're using a 64-bit linux-2.6.16. which version of Ganglia? ganglia-3.1.2 The network traffic information is polluted with phantom 20 PB traffic spikes. I tried lowering the silliness threshold from 1e13 and 1e8 to 4.0e9 and 3.0e6, and I cranked the collect_every on that group from 40 (seconds?) to 5. Now I get exabyte peaks instead of petabyte peaks. what kind of NIC do you have (1GB, 10 GB)? Which hardware and driver is loaded? What is the average network throughput you see? It's the 1 Gbps NIC on the server motherboard, BCM5708 Rev 12. dmesg says, Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.5.5b (January 31, 2007). BCM sounds familiar. Which distro are you using, which kernel? I found an ifdef for REMOVE_BOGUS_SPIKES in libmetrics/linux/metrics.c Defining it has no effect. Maybe you can add some debugging output and check whether that stuff is triggered at all. Maybe the thresholds are not good anymore. Some hints about how to do that would help. I've tried adding err_msg() calls and I can't find where the messages go. They're not in any of the syslog channels. I don't understand the structure of libmetrics/linux/metrics.c well enough to guess where it would make sense to open a new log file. If daemonized, messages go to syslog. If run in foreground, they go to stderr. Just try running the gmond with -d 1 in foreground. You should already get some output in the overflow case. And btw. that code does not *remove* bogus spikes from the RRD database. It just is supposed to prevent their generation. I realize that. With each hack to libmetrics/linux/metrics.c, I've been stopping gmetad and removing all the corrupted rrd files. I don't know how to edit an rrd file. The contrib directory in trunk has the actual removespikes.pl file from the RRD source repository. Useful for updating databases that you do not want to throw away. Can anyone tell me the unit of measure which applies to l_bin and l_bout in that file? Is it bytes per second, bytes per collect_every, bytes per time_threshold? Not completely sure. It would be really great if the authors of libmetrics/linux/metrics.c would document it. Looking at the code, it is per second: /* ** Compute timediff. Check for bogus delta-t */ float t = timediff(proc_net_dev.last_read,stamp); if ( t proc_net_dev.thresh) { err_msg(update_ifdata(%s) - Dubious delta-t: %f,caller,t); return; } stamp = proc_net_dev.last_read; /* ** Compute rates in local variables */ l_bin = l_bytes_in / t; l_bout = l_bytes_out / t; l_pin = l_pkts_in / t; l_pout = l_pkts_out / t; Cheers Martin -- The Planet: dedicated and managed hosting, cloud storage, colocation Stay online with enterprise data centers and the best network in the business Choose flexible plans and management services without long-term contracts Personal 24x7 support from experience hosting pros just a phone call away. http://p.sf.net/sfu/theplanet-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] any workaround for the bogus spikes problem?
- Original Message From: Cameron Spitzer cspit...@nvidia.com To: ganglia-general@lists.sourceforge.net Sent: Tue, February 2, 2010 12:41:46 AM Subject: [Ganglia-general] any workaround for the bogus spikes problem? Hi Cameron, We're trying to use Ganglia to monitor some HP DL580-G5 machines. We're using a 64-bit linux-2.6.16. which version of Ganglia? The network traffic information is polluted with phantom 20 PB traffic spikes. what kind of NIC do you have (1GB, 10 GB)? Which hardware and driver is loaded? What is the average network throughput you see? I found an ifdef for REMOVE_BOGUS_SPIKES in libmetrics/linux/metrics.c Defining it has no effect. I see in the archive this problem has been around for years. Has anyone solved this problem? I am kind of surprised that it does not help. When I wrote that hack a few years ago for 3.0.X, it worked perfectely. I was fighting a driver bug that caused spurious overruns of the driver counters. Maybe you can add some debugging output and check whether that stuff is triggered at all. Maybe the thresholds are not good anymore. And btw. that code does not *remove* bogus spikes from the RRD database. It just is supposed to prevent their generation. Can anyone tell me the unit of measure which applies to l_bin and l_bout in that file? Is it bytes per second, bytes per collect_every, bytes per time_threshold? Not completely sure. Cheers Martin -- The Planet: dedicated and managed hosting, cloud storage, colocation Stay online with enterprise data centers and the best network in the business Choose flexible plans and management services without long-term contracts Personal 24x7 support from experience hosting pros just a phone call away. http://p.sf.net/sfu/theplanet-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Line width in small report graphs
+1 looks really better -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Bernard Li bern...@vanhpc.org To: Jesse Becker haw...@gmail.com Cc: Ganglia Mailing List ganglia-general@lists.sourceforge.net Sent: Wednesday, September 16, 2009 8:31:23 PM Subject: Re: [Ganglia-general] Line width in small report graphs +1 as long as there isn't a compelling reason otherwise ;-) Cheers, Bernard On Wed, Sep 16, 2009 at 11:25 AM, Jesse Becker wrote: Right now, the {load,packet,network}_report graphs are all hard-coded to use LINE2 for several of the metrics. This looks quite nice on the larger graph sizes (i.e. 'medium' and 'large'), but doesn't look quite so good on smaller sizes. I'd like to change this to LINE1, but only for the small graph sizes. Now, with LINE2: http://bayimg.com/PAdpBAACo Proposed, with LINE1: http://bayimg.com/paDpdAAcO Comments? -- Jesse Becker -- Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Come build with us! The BlackBerryreg; Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9#45;12, 2009. Register now#33; http://p.sf.net/sfu/devconf ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
- Original Message From: Carlo Marcelo Arenas Belon [EMAIL PROTECTED] To: Ofer Inbar [EMAIL PROTECTED] Cc: ganglia-general@lists.sourceforge.net Sent: Tuesday, November 25, 2008 9:49:22 AM Subject: Re: [Ganglia-general] gmetric fails when disk is unwriteable? On Fri, Nov 21, 2008 at 11:33:05PM -0500, Ofer Inbar wrote: What's the dependency that causes gmetric to require that the filesystem the CWD is on be writeable? as explained by Brad it is not the CWD that needs to be writeable but a TMPDIR (which for root can also be the current directory) and that is detected by APR. Recent Linux (since around kernel 2.4.16) requires a ramdrive mounted in /dev/shm, so one way to workaround this problem is to define : TMPDIR=/dev/shm Is TMPDIR only used for the include file handler, or also for other stuff. Not that we fill memory with lots of unexpected data. Cheers Martin - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] crazy network graph spikes
There is currently no resolution to this issue. 3.1.x does not fix this problem, however you could work around it by doing this: It could be nice to have the option to suply max values so any data bigger than that max get discarded :-) for specific metric offcourse :-) definitely. That would be a useful addition. Martin - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Question about the bytes_in and bytes_out reports
- Original Message From: Bryan Duxbury [EMAIL PROTECTED] To: ganglia-general@lists.sourceforge.net Sent: Thursday, September 25, 2008 3:47:48 AM Subject: [Ganglia-general] Question about the bytes_in and bytes_out reports Hey all, I'm a new user of Ganglia. Right now I'm running it on a 7-machine cluster of Centos 5 boxes. Everything appears to be working pretty well, except for the bytes_in and bytes_out graph. It appears to always be zero, no matter how much traffic there is. I think I read in some mailing list thread somewhere that this has to do with having gigabit ethernet on the machines. Is this a known issue? Does anyone know what the proper path to a fix would be? Thanks, Bryan Duxbury Hi Bryan, in short, GigaBit NICs should not cause such problems. We need some more information: - which version of Ganglia are you running? 3.1.x or 3.0.x? - are the pkts_in pkts_out graphs showing anything useful? Thanks Martin - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Anyone experience petabyte peaks in network metric in ganglia 3.x.y ?
- Original Message From: Witham, Timothy D [EMAIL PROTECTED] To: Escobio, Roger [EMAIL PROTECTED]; ganglia-general@lists.sourceforge.net ganglia-general@lists.sourceforge.net Sent: Tuesday, September 9, 2008 9:42:34 PM Subject: Re: [Ganglia-general] Anyone experience petabyte peaks in network metric in ganglia 3.x.y ? I am testing ganglia in a cluster of linux but we are getting this confusing peaks in the bytes/s and in the packets/s (image attached) I have been able to minimize this significantly by using code from svn trunk and building with make CPPFLAGS=-DREMOVE_BOGUS_SPIKES IMHO, that should be the default. Hi Tim, the problem is that with NICs faster than 1000 Mbit, the naturally occuring wrap-arounds will come too frequently (especially for the byte counters) and will trigger the remove mechanism and really mess up the data. The better solution would be to bring the networking counters in the Linux kernel to 64-bit (they are 32-bit right now). Then we would not have to care about natural wrap-around for a few years. I once proposed this change, but it was not greeted with much enthusiasm :-( Therefore I #ifdef-ed my check. Especailly as the effect seems to be really a very NIC specific bug. Escobio - what NICs are in the systems in question (all the same?). As I undertand, you are using some 2.6.9 kernel? Cheers Martin - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Anyone experience petabyte peaks in network metric in ganglia 3.x.y ?
-- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Escobio, Roger [EMAIL PROTECTED] To: ganglia-general@lists.sourceforge.net Sent: Wednesday, September 10, 2008 2:40:27 PM Subject: Re: [Ganglia-general] Anyone experience petabyte peaks in network metric in ganglia 3.x.y ? -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: September 10, 2008 6:55 AM To: Witham, Timothy D; Escobio, Roger [CMB-IT]; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Anyone experience petabyte peaks in network metric in ganglia 3.x.y ? - Original Message From: Witham, Timothy D To: Escobio, Roger ; ganglia-general@lists.sourceforge.net Sent: Tuesday, September 9, 2008 9:42:34 PM Subject: Re: [Ganglia-general] Anyone experience petabyte peaks in network metric in ganglia 3.x.y ? I am testing ganglia in a cluster of linux but we are getting this confusing peaks in the bytes/s and in the packets/s (image attached) I have been able to minimize this significantly by using code from svn trunk and building with make CPPFLAGS=-DREMOVE_BOGUS_SPIKES IMHO, that should be the default. Hi Tim, the problem is that with NICs faster than 1000 Mbit, the naturally occuring wrap-arounds will come too frequently (especially for the byte counters) and will trigger the remove mechanism and really mess up the data. The better solution would be to bring the networking counters in the Linux kernel to 64-bit (they are 32-bit right now). Then we would not have to care about natural wrap-around for a few years. I once proposed this change, but it was not greeted with much enthusiasm :-( Therefore I #ifdef-ed my check. Especailly as the effect seems to be really a very NIC specific bug. Escobio - what NICs are in the systems in question (all the same?). As I undertand, you are using some 2.6.9 kernel? You right, we have been seeing this random peaks in HP servers with: Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet Broadcom Corporation NetXtreme II BCM5706 Gigabit Ethernet Hi Escobio, I observed the problem on: 2.6.9-42.ELsmp and BCM5708 Gigabit Ethernet (rev 11) NICs with the bnx2 drivers. The problem is some weird bug when DMAing the counters. Solved in the 2.6.17 timeframe IIRC. The fix might even have been backported to RHEL4Ux, where x 4. Running 2.6.9 (redhat kernel :-) ) Kernel 2.4.9 do not seeing affect, right? Not sure whether those NICs were supported in the stone age :-) How good is to have a maxvalue for bytes/s in the definition of the metrics? So if the counter's diff give more than that just discard that read I know that that will not solve the packets/s peak but it could be a safe check before add the values to stat I created a patch again linux/metrics.c (3.1.1 version) to add the counterdiff function found in *bsd/metrics.c Are you interested in it? Just let me know and I'll send it to the list Yes please. I am definitely like to have a look at your patch. Cheers Martin - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT] Ganglia 3.1.0 tarball ready fortesting...
Hi Craig, basically it is summing up all network interfaces with the exception of loX and the bonding interfaces (at least for Linux). Per-Interface sampling is planned for some future release (not the upcoming 3.1.0). Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Craig Simpson [EMAIL PROTECTED] To: ganglia-general@lists.sourceforge.net Sent: Tuesday, July 29, 2008 8:12:47 PM Subject: Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT] Ganglia 3.1.0 tarball ready fortesting... Pardon my uncertainty but about the default checks in /etc/gmond.conf. For the Network stuff, what interface is it binding to? How does it figure that out. On my cluster I have several interfaces and am doing NIC Bonding on Linux. So really I would want to bind that to and alias. Thanks! Craig -- Get Creative!!! @ http://3rdstone.net Use your BRAIN @ http://brainradar.com Get Wisdom @ http://www.youtube.com/profile_videos?user=drturistarp=r In the circle the beginning and the end are common ~ Heraclitis (540-480BC)- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Is there any APIs or DB data I can use to getmetrics?
Hi Igor, unless you want to rewrite gmetad completely, this is the way to query the database. Basically port 8651 gives you everything, while 8652 allows to do specific queries. Not sure where/whether the query mechanism is actually documented outside the gmetad sources. You can have a look at how the web-frontend uses port 8652. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Igor Rosenberg [EMAIL PROTECTED] To: Hu, Wenzhong [EMAIL PROTECTED] Cc: ganglia-general@lists.sourceforge.net Sent: Wednesday, May 7, 2008 9:54:57 AM Subject: Re: [Ganglia-general] Is there any APIs or DB data I can use to getmetrics? Hi Well, I looked for a way to make sure ganglia was working. The doc suggests polling these interfaces with telnet. Then I understood this only was opening a socket. I decided to make my own in java when I counld't find any existing example. But I'm not sure it's the best way. I am quite certain there must be a way to perform database queries directly. Best Igor -Original Message- From: Hu, Wenzhong [mailto:[EMAIL PROTECTED] Sent: miércoles, 07 de mayo de 2008 5:01 To: Igor Rosenberg Cc: ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Is there any APIs or DB data I can use to getmetrics? Thanks Igor, How did you find out this method? It's quite amazing. I will try it on other versions if I have time. And maybe somebody somewhere can try on other versions also, hopefully :) Regards, Stephen -Original Message- From: Igor Rosenberg [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 06, 2008 9:57 PM To: Hu, Wenzhong [CMB-IT] Cc: ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Is there any APIs or DB data I can use to getmetrics? Hello, I've also come upon the same need, and have resolved (by lack of information) to polling directly the gmetad. My solution works for version 3.0.6, I've never tested any other. You can connect a socket to ports 8651 and 8652 of the machine running gmetad (I don't know what is the difference between both ports). You receive an XML file of the last status monitored. The schema of the result is provided within the answer. I've attached sample output to this mail (one Grid containing one cluster containing one machine). To test the gmetad output yourself, see it running telnet ip 8651 where ip is the IP of the machine running gmetad If you speak java, you may use ganglia in your programs modifying the following code snippet : /** * Get a reader on the Ganglia output, whihc you can then parse with your prefered XML parser * @see http://www.mail-archive.com/[EMAIL PROTECTED]/msg 03642.html **/ protected BufferedReader openGangliaSocket() throws UnknownHostException, IOException { String gangliaHost =192.168.1.2; int gangliaPort = 8651; String socketCall = ; // another poll string can be something matching /GRIDNAME/MACHINENAME/METRIC System.out.println(Polling socket + gangliaHost + : + gangliaPort + , cmd = + socketCall); Socket gangliaSocket = new Socket(gangliaHost, gangliaPort); PrintWriter gangliaWriter = new PrintWriter(gangliaSocket.getOutputStream(), true); gangliaWriter.println(socketCall); BufferedReader gangliaReader; gangliaReader = new BufferedReader( new InputStreamReader(gangliaSocket.getInputStream()) ); return gangliaReader; } Hope that helps somebody somewhere :) Igor -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hu, Wenzhong Sent: lunes, 05 de mayo de 2008 15:04 To: Carlo Marcelo Arenas Belon Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Is there any APIs or DB data I can use to getmetrics? Hi Carlo, Your explanation is very clear. Now I know where I should start. Thanks very much indeed. Stephen -Original Message- From: Carlo Marcelo Arenas Belon [mailto:[EMAIL PROTECTED] Sent: Monday, May 05, 2008 7:30 PM To: Hu, Wenzhong [CMB-IT] Cc: Ron Wellnitz; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Is there any APIs or DB data I can use to get metrics? On Mon, May 05, 2008 at 06:11:51PM +0800, Hu, Wenzhong wrote: What I need is the rrdtool schema or something for Ganglia :) rrdtool is a time series database, so there is technically no such thing as a schema (like you would expect on a relational database), as each metric is stored in an independent file (of fixed size and continuously doing summarizations), and the cluster is represented by a directory tree on disk. the definition of which and how many buckets (known as RRAs) to have for each metric
Re: [Ganglia-general] Need a script to remove spikes from network RRDs
- Original From: Martin Knoblauch [EMAIL PROTECTED] To: john allspaw [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: ganglia general ganglia-general@lists.sourceforge.net Sent: Wednesday, February 27, 2008 8:55:26 AM Subject: Re: [Ganglia-general] Need a script to remove spikes from network RRDs Original Message From: john allspaw To: Martin Knoblauch ; [EMAIL PROTECTED] Cc: ganglia general Sent: Tuesday, February 26, 2008 7:38:07 PM Subject: Re: [Ganglia-general] Need a script to remove spikes from network RRDs Here is what comes with rrdtool, I've used it with some success... http://oss.oetiker.ch/rrdtool/pub/contrib/removespikes.tar.gz -john cool. Almost what I need. It seems to be a bit to smart for my purpose, but making things stupid is easy :-) Hi John, after adding an option/mode to remove based on value instead of bin-distribution the tool did exactely what I needed. I have pushed back my changes to the rrd people. Thanks a lot. For the meeting: Should we contact the author and ask wheter we can put the script into the distribution under cool-stuff? Cheers Martin - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Need a script to remove spikes from network RRDs
- Original Message From: aurbain [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; ganglia general ganglia-general@lists.sourceforge.net Sent: Wednesday, February 27, 2008 5:11:48 PM Subject: Re: [Ganglia-general] Need a script to remove spikes from network RRDs Thanks for the info Martin. So its not a rollover issue after all. By the way, this issue also lives in rhel4u4 32 bit with bnx2 version 1.4.43f interesting. From my reading only the 64-bit version was affected. Anyway, I have a fix which just throws away any samples where an overflow, correct or bogus, occurs. That is definitely fine in 64-bit land. Even at full speed, a 1GBit NIC would overflow only after 5000 years. Nothing that I worry about much :-) Even 5 years for a future 1Tbit NIC is not that bad... But in 32-bit, a 1Gbit NIC could overflow every 40 seconds. And that is very short. Cheers Martin Martin Knoblauch wrote: - Original Message From: aurbain To: Martin Knoblauch Cc: [EMAIL PROTECTED]; ganglia general Sent: Tuesday, February 26, 2008 8:25:13 PM Subject: Re: [Ganglia-general] Need a script to remove spikes from network RRDs Happens only on 64-bit systems. Now, my fix kills the generation of the spikes, but my RRD database is now tainted for another 12 month. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Moving all built-in metrics to metric modules...
Hi Brad, that seems to be a pretty useful move. Seems it is time that I really start looking closely at 3.1.x Cheers Martin Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message From: Brad Nicholes [EMAIL PROTECTED] To: [EMAIL PROTECTED]; ganglia-general@lists.sourceforge.net Sent: Tuesday, December 18, 2007 11:44:45 PM Subject: [Ganglia-developers] Moving all built-in metrics to metric modules... I just committed a rather substantial patch to Ganglia 3.1.0 trunk which will affect the way that gmond 3.1.x is deployed. I am posting this to both the developer list and the general list so that all will be aware of the changes and why they are important. The primary purpose for the patch was to remove all of the built in metrics out of the gmond binary and allow them to be built as loadable modules. The following is a more detailed list of what has changed. Hopefully from a user perspective, gmond will continue to work as it has in the past. But going forward, it will be much more flexible with regards to the core set of metrics. * All built-in metrics have been removed from the gmond binary - A new set of core metric modules have been created that represent the same set metrics that gmond has always gathered. These new core modules are mod_cpu.so, mod_disk.so, mod_load.so, mod_mem.so, mod_net.so, mod_proc.so and mod_sys.so. Each of these modules is basically a wrapper around the metric functions that exist in libmetrics. Being wrappers, they still make the same metric function calls as have always been made. And since libmetrics contains all of the platform specific metric code, the metric function calls made by the core modules will continue to do the right thing for all of the platforms that have been previously supported. - There is also an extra module called core_metrics which contains the heartbeat, location and gexec metrics. Even though this module could be dynamically loaded in the same manner as the others, it is always statically linked simply because gmond would not be able to function properly without these metrics so there is no real reason to allow these metrics to be dynamically loaded. - Some additional configuration has been added to the gmond.conf file. Because the core metrics are now implemented as modules, this requires a module configuration block that instructs gmond to load each module. A set of module blocks has been added to the default gmond.conf file. * All metric specific metadata definitions have been removed from protocol.x - With the refactoring of the XDR data and removal of the builtin metrics, there is no longer any need for XDR to have intimate knowledge of the core metrics. Therefore the metric structure array and enum have been removed and are now part of the core metric modules themselves. * --enable-static-build statically links the core metric modules - Building gmond statically will statically link not only APR, expat and libconfuse, it will also statically link all of the core metric modules into the gmond binary. The result should be a gmond binary that looks and feels just like the old 3.0.x statically linked gmond binary. The one exception is that a module statement is still required in the gmond.conf file. The difference between the module configuration block for dynamically loaded modules and the module blocks for statically linked modules is whether or not a path to the .so is included. The configure script and makefiles have been modified to detect --enable-static-build and build the default gmond.conf file appropriately. * --enable-static-build + --enable-python statically links the python module - One of the downsides of building gmond 3.1.x statically was that doing so would disable all of the dynamically loadable module capability. The reason for this is the need for both gmond and the pluggable modules to dynamically link with libapr1. However, if both --enable-static-build and --enable-python are specified during configure, a gmond binary will be built with mod_python statically linked. This provides gmond with the ability to continue to load and run python metric modules in the same manner as the non-static build. In other words, even though statically linking gmond will disable pluggable C interface modules, python pluggable modules will still continue to work. * All metrics carry a group designation - Now that all metrics have been implemented as loadable modules, the metrics have also been assigned to groups. The XML that is produced by gmond and gmetad will carry an tag that defines which group each metric belongs to. This will allow the web front
Re: [Ganglia-general] Overriding hostname
--- Andy Brody [EMAIL PROTECTED] wrote: I'd also really like this functionality. A slightly different but related problem: it's been tremendously annoying that gmond on the head node doesn't know that data coming from different interfaces of a multihomed machine is really just one machine. Having each gmond pass some unique per-host identifier other than ip address would be great. -Andy Brody Richard Mohr wrote: On Thu, 2007-09-20 at 05:44 +0100, richard grevis wrote: There have been discussions earlier about getting each gmond to send a hostname rather that using the source address and reverse DNSing it on the headnode. I would definitely like this functionality. me to. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Bad Network data
Ian, long day :-( Thanks Martin --- Ian Cunningham [EMAIL PROTECTED] wrote: Martin, I think bnx2 is the kernel module for the NIC. B.N.X. meaning Broadcom NetXtreme. Cheers, Ian Martin Knoblauch wrote: Hi Jeff, could you provide me with the output from: ifconfig -a netstat -i cat /proc/net/dev And what is bnx? Thanks Martin --- Jeff Blasius [EMAIL PROTECTED] wrote: Hello Martin, Here is some more information regarding the setup. Thank You! -jeff 06:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (rev 11) 0d:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09) Linux c001 2.6.9-42.ELsmp #1 SMP Tue Aug 15 10:35:26 BST 2006 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux WS release 4 (Nahant Update 4) [EMAIL PROTECTED] ~]# dmesg |grep eth0 divert: allocating divert_blk for eth0 eth0: Broadcom NetXtreme II BCM5708 1000Base-SX (B1) PCI-X 64-bit 133MHz found at mem f400, IRQ 11, node addr 0015c5f7cc3e bnx2: eth0: using MSI eth0: no IPv6 routers present [EMAIL PROTECTED] ~]# dmesg |grep eth1 divert: allocating divert_blk for eth1 eth1: Broadcom NetXtreme II BCM5708 1000Base-SX (B1) PCI-X 64-bit 133MHz found at mem f800, IRQ 11, node addr 0015c5f7cc3c bnx2: eth1: using MSI bnx2: eth1 NIC Link is Up, 1000 Mbps full duplex eth1: no IPv6 routers present On 4/23/07, Martin Knoblauch [EMAIL PROTECTED] wrote: Hi Jeff, what kind of nodes and networking? We have known problems with AIX and Gigabit due to overruns in the byte_in/out code. Cheers Martin --- Jeff Blasius [EMAIL PROTECTED] wrote: Hello! On one of our clusters, ganglia seems to be reporting erroneous network information. See: http://research.yale.edu/hpc/net.jpeg Notice the Pb range spikes? Unfortunately this happens randomly, at least once an hour, on single nodes, which makes any real network information from the cluster Network plot disappear. This is gmond/gmetad version 3.0.3-1, which is running just fine on most of the clusters in our grid. Any ideas? The only unique network setup here is that eth0 and eth1 are both up, but only eth1 has a connection to the switch. Thank You, jeff -- Jeff Blasius / [EMAIL PROTECTED] Phone: (203)432-9940 51 Prospect Rm. 011 High Performance Computing (HPC) UNIX Systems Administrator, WorkStation Support (WSS) Yale University Information Technology Services (ITS) - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Jeff Blasius / [EMAIL PROTECTED] Phone: (203)432-9940 51 Prospect Rm. 011 High Performance Computing (HPC) UNIX Systems Administrator, WorkStation Support (WSS) Yale University Information Technology Services (ITS) -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Bad Network data
Hi Jeff, what kind of nodes and networking? We have known problems with AIX and Gigabit due to overruns in the byte_in/out code. Cheers Martin --- Jeff Blasius [EMAIL PROTECTED] wrote: Hello! On one of our clusters, ganglia seems to be reporting erroneous network information. See: http://research.yale.edu/hpc/net.jpeg Notice the Pb range spikes? Unfortunately this happens randomly, at least once an hour, on single nodes, which makes any real network information from the cluster Network plot disappear. This is gmond/gmetad version 3.0.3-1, which is running just fine on most of the clusters in our grid. Any ideas? The only unique network setup here is that eth0 and eth1 are both up, but only eth1 has a connection to the switch. Thank You, jeff -- Jeff Blasius / [EMAIL PROTECTED] Phone: (203)432-9940 51 Prospect Rm. 011 High Performance Computing (HPC) UNIX Systems Administrator, WorkStation Support (WSS) Yale University Information Technology Services (ITS) - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Bad Network data
Hi Jeff, could you provide me with the output from: ifconfig -a netstat -i cat /proc/net/dev And what is bnx? Thanks Martin --- Jeff Blasius [EMAIL PROTECTED] wrote: Hello Martin, Here is some more information regarding the setup. Thank You! -jeff 06:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (rev 11) 0d:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09) Linux c001 2.6.9-42.ELsmp #1 SMP Tue Aug 15 10:35:26 BST 2006 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux WS release 4 (Nahant Update 4) [EMAIL PROTECTED] ~]# dmesg |grep eth0 divert: allocating divert_blk for eth0 eth0: Broadcom NetXtreme II BCM5708 1000Base-SX (B1) PCI-X 64-bit 133MHz found at mem f400, IRQ 11, node addr 0015c5f7cc3e bnx2: eth0: using MSI eth0: no IPv6 routers present [EMAIL PROTECTED] ~]# dmesg |grep eth1 divert: allocating divert_blk for eth1 eth1: Broadcom NetXtreme II BCM5708 1000Base-SX (B1) PCI-X 64-bit 133MHz found at mem f800, IRQ 11, node addr 0015c5f7cc3c bnx2: eth1: using MSI bnx2: eth1 NIC Link is Up, 1000 Mbps full duplex eth1: no IPv6 routers present On 4/23/07, Martin Knoblauch [EMAIL PROTECTED] wrote: Hi Jeff, what kind of nodes and networking? We have known problems with AIX and Gigabit due to overruns in the byte_in/out code. Cheers Martin --- Jeff Blasius [EMAIL PROTECTED] wrote: Hello! On one of our clusters, ganglia seems to be reporting erroneous network information. See: http://research.yale.edu/hpc/net.jpeg Notice the Pb range spikes? Unfortunately this happens randomly, at least once an hour, on single nodes, which makes any real network information from the cluster Network plot disappear. This is gmond/gmetad version 3.0.3-1, which is running just fine on most of the clusters in our grid. Any ideas? The only unique network setup here is that eth0 and eth1 are both up, but only eth1 has a connection to the switch. Thank You, jeff -- Jeff Blasius / [EMAIL PROTECTED] Phone: (203)432-9940 51 Prospect Rm. 011 High Performance Computing (HPC) UNIX Systems Administrator, WorkStation Support (WSS) Yale University Information Technology Services (ITS) - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Jeff Blasius / [EMAIL PROTECTED] Phone: (203)432-9940 51 Prospect Rm. 011 High Performance Computing (HPC) UNIX Systems Administrator, WorkStation Support (WSS) Yale University Information Technology Services (ITS) -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Help! I have a petabyte/s network
David, good catch. I will have to look at it for a bit. Cheers Martin --- David Wong [EMAIL PROTECTED] wrote: I don't write much code nowadays, so I'm going to need a lot of help with this. I dug through the ganglia code and I found this interesting tidbit in libmetrics/aix/metrics.c which may be indicative of the problem. There's an assignment from cur_ninfo.ibytes to cur_net_stat.ibytes, but the types of the two variables are different. net_stat::ibytes is a double: struct net_stat{ double ipackets; double opackets; double ibytes; double obytes; } cur_net_stat; and we have *ninfo declared here: perfstat_netinterface_total_t ninfo[2],*last_ninfo, *cur_ninfo ; libperfstat.h has perfstat_netinterface_total_t::ibytes as u_longlong_t. Does this code try to do what I think it is doing, i.e. assign an unsigned 64 bit integer to a signed 64bit integer? I'm willing to test the code if someone who's more adept at coding and building will take on the challenge. It looks to me that the type mismatch will have to fixed in a few places, such as CALC_NETSTAT, and we'll have to add an unsigned long long to g_val_t too. Those are the ones I can see so far. David Wong Senior Systems Engineer Management Dynamics, Inc. Phone: 201-804-6127 [EMAIL PROTECTED] -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 12:00 PM To: David Wong; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Help! I have a petabyte/s network David, as far as I remember, the AIX metrics code had an overflow/wrap-around problem prior to 3.0.4. Maybe the fixes are not thorough enough. The packets/sec are of course less affected. Cheers Martin --- David Wong [EMAIL PROTECTED] wrote: Ganglia is reporting that I'm pushing up to 200 Petabytes/s through my network. Nobody tell the network admin! I'm running Ganglia 3.0.4 with the Power5 add-ons on AIX5.3 Bytes in and out statistics generally appear to have the right values. However at random times, I get spikes in the petabytes/s range. Here's a dump of the bytes_in database. At first, I suspected perhaps these coincide with some counters getting reset, but they don't occur at regular intervals. !-- 2007-03-27 20:42:00 GMT / 1175028120 -- rowv 1.9268390706e+05 /v/row !-- 2007-03-27 20:48:00 GMT / 1175028480 -- rowv 1.5833184975e+05 /v/row !-- 2007-03-27 20:54:00 GMT / 1175028840 -- rowv 1.6838302753e+05 /v/row !-- 2007-03-27 21:00:00 GMT / 1175029200 -- rowv 1.3766069592e+05 /v/row !-- 2007-03-27 21:06:00 GMT / 1175029560 -- rowv 2.1711888414e+05 /v/row !-- 2007-03-27 21:12:00 GMT / 1175029920 -- rowv 4.9959709273e+16 /v/row !-- 2007-03-27 21:18:00 GMT / 1175030280 -- rowv 1.7401339783e+05 /v/row !-- 2007-03-27 21:24:00 GMT / 1175030640 -- rowv 2.0955720861e+05 /v/row !-- 2007-03-27 21:30:00 GMT / 1175031000 -- rowv 1.9032255300e+05 /v/row !-- 2007-03-27 21:36:00 GMT / 1175031360 -- rowv 1.9162727036e+05 /v/row !-- 2007-03-27 21:42:00 GMT / 1175031720 -- rowv 1.2703790825e+05 /v/row Can anyone shed light on what might be happening? Any pointers for debugging? David Wong Senior Systems Engineer Management Dynamics, Inc. Phone: 201-804-6127 [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDE V ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b
Re: [Ganglia-general] gmetad patch to contact random data_source hosts
Tim, your diff command looks a bit surprising to me. The revision number looks like CVS to me and we are SVN since quite some time. Which version of Ganglia have you checked out? Cheers Martin --- Witham, Timothy D [EMAIL PROTECTED] wrote: Hi, I just had a situation where the first host in a gmetad data_source accepts the connection but offers no data, like this: poll() timeout for [clustername] data source after 0 bytes read Gmetad always tries the sources in order and so it just keeps getting stuck on this first one, and losing the data for the entire cluster. Here is a quick patch that tries random hosts from the list instead, and solved my problem. It is not careful to make sure it tried them all, but if it fails it will just try again next time. If someone wants to fix it to try all the sources in a random order, that would be fine. Perhaps this could be included in the next release unless someone knows a good reason to always try the sources in order. Thanks! -8- diff -c -r1.1.1.1 data_thread.c *** data_thread.c 19 Mar 2007 18:52:32 - 1.1.1.1 --- data_thread.c 28 Mar 2007 18:12:08 - *** *** 18,24 void * data_thread ( void *arg ) { !int i, sleep_time, bytes_read, rval; data_source_list_t *d = (data_source_list_t *)arg; g_inet_addr *addr; g_tcp_socket *sock=0; --- 18,24 void * data_thread ( void *arg ) { !int i, j, sleep_time, bytes_read, rval; data_source_list_t *d = (data_source_list_t *)arg; g_inet_addr *addr; g_tcp_socket *sock=0; *** *** 60,75 if(d-last_good_index = 0) sock = g_tcp_socket_new ( d-sources[d-last_good_index] ); ! /* If there was no good connection last time or the above connect failed then try each host in the list. */ if(!sock) { ! for(i=0; i d-num_sources; i++) { ! /* Find first viable source in list. */ ! sock = g_tcp_socket_new ( d-sources[i] ); if( sock ) { ! d-last_good_index = i; break; } } --- 60,80 if(d-last_good_index = 0) sock = g_tcp_socket_new ( d-sources[d-last_good_index] ); ! /* If there was no good connection last time or the above ! connect failed then try random hosts in the list. We try ! random ones in case someone is accepting the connection ! but refusing to provide any data; we don't want to get ! stuck with a non-working host. */ if(!sock) { ! for(i=0; i d-num_sources * 2; i++) { ! /* Find random viable source in list. */ ! j = d-num_sources * (rand() / (RAND_MAX - 1.0)); ! sock = g_tcp_socket_new ( d-sources[j] ); if( sock ) { ! d-last_good_index = j; break; } } -8-- -- [EMAIL PROTECTED]; I don't speak for Intel or anyone. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Gmetad and web frontend on different machines.
Richard, depending on the cluster size, writing the RRDs via NFS might turn out to be a huge bottleneck. Cheers Martin --- [EMAIL PROTECTED] wrote: Saundry, It sort of looks like you can, but actually you can't. gmetad writes to rrd databases as local files, and the web and php read rrd databases as local (actually it invokes rrdtool itself). I imagine you could separate the two using NFS filessystems, but I have not tried this. kind regards, Richard Grevis Production Architecture Barclays Capital, Canary Wharf, London, E14 4BB -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of saundrya mishra Sent: 29 March 2007 14:30 To: ganglia-general@lists.sourceforge.net Subject: [Ganglia-general] Gmetad and web frontend on different machines. Hi There, I am new to Ganglia. Can we have gmetad and web frontend for a cluster to be running on two different machines?? If yes, then how is it possible since i read in the configuration file of the web frontend that the RRDTool databases need to be local to be read? Greetings, Saundrya. For more information about Barclays Capital, please visit our web site at http://www.barcap.com. Internet communications are not secure and therefore the Barclays Group does not accept legal responsibility for the contents of this message. Although the Barclays Group operates anti-virus programmes, it does not accept responsibility for any damage whatsoever that is caused by viruses being passed. Any views or opinions presented are solely those of the author and do not necessarily represent those of the Barclays Group. Replies to this email may be monitored by the Barclays Group for operational or business reasons. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia custom Round-Robin archives RRA
Hi, the definition in gmetad.conf is only for new RRD files. There are two options: - throw your data away - modify the old data. If you look at bugzilla #33 you will find an attached script that should do what you want. It is not in the sources because I am lazy and the Licensing is not clear yet. Cheers Martin --- CASTRO Paulo Edgar [EMAIL PROTECTED] wrote: Hi all. We have been testing ganglia here implemented in about 250 machines. By the way, good job on the tool guys. We've been peeking at the conf files namely gmetad.conf and we found this commented option about Custom Round-Robin archives. The thing is, we wanted to be able to have a RRA of our own who could aggregate all the 5 minute PDP for a whole year. See what I mean ;), So we wouldn't lose granularity while reading directly from the rrd files. We tried adding this to the gmetad.conf RRAs RRA:AVERAGE:0.5:1:105408 being 105408 the number of 5 minutes in a year. But we still haven't noticed any change nor the rrd files have grown enough to accommodate the new RRA. How can we manage to do this? Do we need to start the whole colection process again, erasing the previous data and files? Will it work with this new option? Is this syntax for the conf file correct? Tkx in advance, PECastro - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Help! I have a petabyte/s network
David, after some looking at CALC_NETSTAT I see no *type* problems here: #define CALC_NETSTAT(type) (double) ((cur_ninfo-typelast_ninfo-type)? -1:(cur_ninfo-type - last_ninfo-type)/timediff) cur_ninfo-type and last_ninfo-type are of the same type and the macro will just return a double float of either -1 or a positive rate. It would be interesting to see the values of cur_ninfo-type, last_ninfo-type and timediff when you observe the petabyte performance. Can you add some debug statements around lines 873-876? Cheers Martin --- David Wong [EMAIL PROTECTED] wrote: I don't write much code nowadays, so I'm going to need a lot of help with this. I dug through the ganglia code and I found this interesting tidbit in libmetrics/aix/metrics.c which may be indicative of the problem. There's an assignment from cur_ninfo.ibytes to cur_net_stat.ibytes, but the types of the two variables are different. net_stat::ibytes is a double: struct net_stat{ double ipackets; double opackets; double ibytes; double obytes; } cur_net_stat; and we have *ninfo declared here: perfstat_netinterface_total_t ninfo[2],*last_ninfo, *cur_ninfo ; libperfstat.h has perfstat_netinterface_total_t::ibytes as u_longlong_t. Does this code try to do what I think it is doing, i.e. assign an unsigned 64 bit integer to a signed 64bit integer? I'm willing to test the code if someone who's more adept at coding and building will take on the challenge. It looks to me that the type mismatch will have to fixed in a few places, such as CALC_NETSTAT, and we'll have to add an unsigned long long to g_val_t too. Those are the ones I can see so far. David Wong Senior Systems Engineer Management Dynamics, Inc. Phone: 201-804-6127 [EMAIL PROTECTED] -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 12:00 PM To: David Wong; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Help! I have a petabyte/s network David, as far as I remember, the AIX metrics code had an overflow/wrap-around problem prior to 3.0.4. Maybe the fixes are not thorough enough. The packets/sec are of course less affected. Cheers Martin --- David Wong [EMAIL PROTECTED] wrote: Ganglia is reporting that I'm pushing up to 200 Petabytes/s through my network. Nobody tell the network admin! I'm running Ganglia 3.0.4 with the Power5 add-ons on AIX5.3 Bytes in and out statistics generally appear to have the right values. However at random times, I get spikes in the petabytes/s range. Here's a dump of the bytes_in database. At first, I suspected perhaps these coincide with some counters getting reset, but they don't occur at regular intervals. !-- 2007-03-27 20:42:00 GMT / 1175028120 -- rowv 1.9268390706e+05 /v/row !-- 2007-03-27 20:48:00 GMT / 1175028480 -- rowv 1.5833184975e+05 /v/row !-- 2007-03-27 20:54:00 GMT / 1175028840 -- rowv 1.6838302753e+05 /v/row !-- 2007-03-27 21:00:00 GMT / 1175029200 -- rowv 1.3766069592e+05 /v/row !-- 2007-03-27 21:06:00 GMT / 1175029560 -- rowv 2.1711888414e+05 /v/row !-- 2007-03-27 21:12:00 GMT / 1175029920 -- rowv 4.9959709273e+16 /v/row !-- 2007-03-27 21:18:00 GMT / 1175030280 -- rowv 1.7401339783e+05 /v/row !-- 2007-03-27 21:24:00 GMT / 1175030640 -- rowv 2.0955720861e+05 /v/row !-- 2007-03-27 21:30:00 GMT / 1175031000 -- rowv 1.9032255300e+05 /v/row !-- 2007-03-27 21:36:00 GMT / 1175031360 -- rowv 1.9162727036e+05 /v/row !-- 2007-03-27 21:42:00 GMT / 1175031720 -- rowv 1.2703790825e+05 /v/row Can anyone shed light on what might be happening? Any pointers for debugging? David Wong Senior Systems Engineer Management Dynamics, Inc. Phone: 201-804-6127 [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDE V ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Take Surveys. Earn Cash. Influence the Future of IT Join
Re: [Ganglia-general] Help! I have a petabyte/s network
David, as far as I remember, the AIX metrics code had an overflow/wrap-around problem prior to 3.0.4. Maybe the fixes are not thorough enough. The packets/sec are of course less affected. Cheers Martin --- David Wong [EMAIL PROTECTED] wrote: Ganglia is reporting that I'm pushing up to 200 Petabytes/s through my network. Nobody tell the network admin! I'm running Ganglia 3.0.4 with the Power5 add-ons on AIX5.3 Bytes in and out statistics generally appear to have the right values. However at random times, I get spikes in the petabytes/s range. Here's a dump of the bytes_in database. At first, I suspected perhaps these coincide with some counters getting reset, but they don't occur at regular intervals. !-- 2007-03-27 20:42:00 GMT / 1175028120 -- rowv 1.9268390706e+05 /v/row !-- 2007-03-27 20:48:00 GMT / 1175028480 -- rowv 1.5833184975e+05 /v/row !-- 2007-03-27 20:54:00 GMT / 1175028840 -- rowv 1.6838302753e+05 /v/row !-- 2007-03-27 21:00:00 GMT / 1175029200 -- rowv 1.3766069592e+05 /v/row !-- 2007-03-27 21:06:00 GMT / 1175029560 -- rowv 2.1711888414e+05 /v/row !-- 2007-03-27 21:12:00 GMT / 1175029920 -- rowv 4.9959709273e+16 /v/row !-- 2007-03-27 21:18:00 GMT / 1175030280 -- rowv 1.7401339783e+05 /v/row !-- 2007-03-27 21:24:00 GMT / 1175030640 -- rowv 2.0955720861e+05 /v/row !-- 2007-03-27 21:30:00 GMT / 1175031000 -- rowv 1.9032255300e+05 /v/row !-- 2007-03-27 21:36:00 GMT / 1175031360 -- rowv 1.9162727036e+05 /v/row !-- 2007-03-27 21:42:00 GMT / 1175031720 -- rowv 1.2703790825e+05 /v/row Can anyone shed light on what might be happening? Any pointers for debugging? David Wong Senior Systems Engineer Management Dynamics, Inc. Phone: 201-804-6127 [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] mcast_ttl in 3.0 gmond.conf
--- Ian Cunningham [EMAIL PROTECTED] wrote: Gil, Gilad Raphaelli wrote: Hello, I'm having a problem increasing gmond's multicast packet ttl. I've tried putting mcast_ttl on a line of its own and inside the global { } and udp_send_channel {} directives and always get gmond.conf parsing errors when trying to start gmond-3.0.4. Any pointers on where mcast_ttl can be set? The error message is: gmond.conf:200: no such option 'mcast_ttl' Finally, mcast_ttl doesn't appear in gmond -t - has this functionality been removed altogether? Thanks, Gil I no longer use multicast so I not sure it works, but from looking at the source code, It looks like it was changed to 'ttl' under 'udp_send_channel'. which is even correctly documented in the shipping tarball. We should update the stuff on the weg-page though ... Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] PBS Queue visualisation
Adam, look at the report/compound graphs in web/graph.php They should basically do what you want. Cheers Martin --- Adam Gray [EMAIL PROTECTED] wrote: I'm running ganglia on a cluster managed with OpenPBS. I have made a few extra metrics for monitoring CPU temp and batch system jobs on each node. I was wondering how I could go about making a sort of cluster queue usage graph. Each queue would pile on top of each other the number of nodes it is using. E.g. if queue1 was using 24 of 124 available nodes, and queue2 was using 96, there would be a section at the bottom 20% and a different colored section on the next 75%, and the top 5% would be empty. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] XML error: no element found at 1
Ashutok, you need to do a query if you use port 8562 (the web interface does). What happens if you do telnet localhost 8561. That should give you the complete gmetad XML stream. Is the rrdroot directory writable to the owner of the gmetad process? It should belong to e.g. nobody. This is a common mistake. cheers Martin --- Ashutosh Mahajan [EMAIL PROTECTED] wrote: hello everyone, We are having problems installing ganglia version 3.0.4 with rrdtool-1.2.15. we can successfully do make, make install. gstat -a also seems to work. telnet localhost 8649 seems to throw out correct XML file. However, gmetad seems to be having some problems. telnet localhost 8652 seems to hang forever with the message: Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. if i access ganglia through the web, i get this message after a long long time: There was an error collecting ganglia data (192.168.1.1:8652): XML error: no element found at 1 rrd_rootdir also remains empty. what could be wrong? i can provide more details if necessary. thanks in advance. -- Regards Ashutosh www.lehigh.edu/~asm4 This message was sent using IMP, the Internet Messaging Program. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Two similar linux hosts provides different metrics
Vitaly, in this case try to run gmond with a debug level higher that 2. Maybe this sheds some light on it. Or, you could add debug statements to the proc_run_func and proc_total_func code. But: first of all show us the output of cat /proc/loadavg on both nodes. cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: It seems like we have different numbers in gmond: HOST NAME=5.5.5.5 IP=5.5.5.5 REPORTED=1168934873 TN=2 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1166534354 .. METRIC NAME=proc_total VAL=185 TYPE=uint32 UNITS= TN=229 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ .. METRIC NAME=proc_run VAL=0 TYPE=uint32 UNITS= TN=229 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ HOST NAME=5.5.5.6 IP=5.5.5.6 REPORTED=1168934871 TN=3 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1166534349 METRIC NAME=proc_run VAL=15 TYPE=uint32 UNITS= TN=68 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=proc_total VAL=439 TYPE=uint32 UNITS= TN=68 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ Thanks, Vitaly -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Monday, January 15, 2007 12:30 PM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, where do you see the invalid numbers: a) in the gmond XML Stream (telnet/nc to the gmond XML port) b) in the XML Stream from gmetad (telnet/nc to the gmetad XML port) c) only in the web-frontend Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: NON-BUSY HOST: # ps axl|wc 61 8625865 # uptime 08:54:55 up 204 days, 2:00, 1 user, load average: 0.00, 0.00, 0.00 BUSY HOST ]# ps axl|wc 62 8775977 ]# uptime 08:55:18 up 31 days, 16:30, 1 user, load average: 0.04, 0.01, 0.00 -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Thursday, January 11, 2007 10:54 AM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, what does ps axl show on both hosts, as that is basically what gmond looks at? If it is already different there, the problem is not ganglia related. (OK, I see you already checked ...) What are the load averages according to uptime? Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: Hi, I have a weird problem - two linux hosts with similar configuration provide very different metrics about number of running processes - one shows about 2, and second about 20-40 (I speak about concentrated load graph at top right.) proc_total is different too - 171 vs. 350 (BTW, ps -ef |wc == 61 on both boxes) Both machines are RHEL3 kernel 2.4.21-37.ELsmp with ganglia-gmond-3.0.3-1 installed from RPM. Any ideas? Thanks, Vitaly -- --- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforge CID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Two similar linux hosts provides different metrics
Vitaly, gmond on Linux just interprets the fourth filed of /proc/loadavg. The number in front of the slash is the number of running processes, the number following the slash is the total number of processes. Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: .5: cat /proc/loadavg 0.04 0.06 0.01 1/185 10512 .6: cat /proc/loadavg 1.03 1.01 1.00 2/441 19965 Oops! I think I'm starting to understand - number of processes on both machines are the same, but number the threads are different. probably gmond counts threads, not processes: .5: ps -ef|wc 64 ps -efm|wc 187 .6: ps -ef|wc 62 ps -efm|wc 441 -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 16, 2007 11:59 AM To: Vitaly Karasik; [EMAIL PROTECTED]; ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Two similar linux hosts provides different metrics Vitaly, in this case try to run gmond with a debug level higher that 2. Maybe this sheds some light on it. Or, you could add debug statements to the proc_run_func and proc_total_func code. But: first of all show us the output of cat /proc/loadavg on both nodes. cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: It seems like we have different numbers in gmond: HOST NAME=5.5.5.5 IP=5.5.5.5 REPORTED=1168934873 TN=2 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1166534354 .. METRIC NAME=proc_total VAL=185 TYPE=uint32 UNITS= TN=229 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ .. METRIC NAME=proc_run VAL=0 TYPE=uint32 UNITS= TN=229 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ HOST NAME=5.5.5.6 IP=5.5.5.6 REPORTED=1168934871 TN=3 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1166534349 METRIC NAME=proc_run VAL=15 TYPE=uint32 UNITS= TN=68 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=proc_total VAL=439 TYPE=uint32 UNITS= TN=68 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ Thanks, Vitaly -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Monday, January 15, 2007 12:30 PM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, where do you see the invalid numbers: a) in the gmond XML Stream (telnet/nc to the gmond XML port) b) in the XML Stream from gmetad (telnet/nc to the gmetad XML port) c) only in the web-frontend Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: NON-BUSY HOST: # ps axl|wc 61 8625865 # uptime 08:54:55 up 204 days, 2:00, 1 user, load average: 0.00, 0.00, 0.00 BUSY HOST ]# ps axl|wc 62 8775977 ]# uptime 08:55:18 up 31 days, 16:30, 1 user, load average: 0.04, 0.01, 0.00 -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Thursday, January 11, 2007 10:54 AM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, what does ps axl show on both hosts, as that is basically what gmond looks at? If it is already different there, the problem is not ganglia related. (OK, I see you already checked ...) What are the load averages according to uptime? Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: Hi, I have a weird problem - two linux hosts with similar configuration provide very different metrics about number of running processes - one shows about 2, and second about 20-40 (I speak about concentrated load graph at top right.) proc_total is different too - 171 vs. 350 (BTW, ps -ef |wc == 61 on both boxes) Both machines are RHEL3 kernel 2.4.21-37.ELsmp with ganglia-gmond-3.0.3-1 installed from RPM. Any ideas? Thanks, Vitaly -- --- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforge CID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net
Re: [Ganglia-general] XML error: no element found at 1
Hi Ashutosh, sorry for the wrong port. I meant of course 8651. You could try to run gmetad with a high debug level. This could help to track down the problem. Also, could you please post the gmetad.conf file? Cheers Martin --- Ashutosh Mahajan [EMAIL PROTECTED] wrote: Quoting Martin Knoblauch [EMAIL PROTECTED]: Ashutok, you need to do a query if you use port 8562 (the web interface does). What happens if you do telnet localhost 8561. That should give you the complete gmetad XML stream. thanks for the prompt reply. you meant 8651, rather than 8561? [EMAIL PROTECTED] ~]$ telnet localhost 8651 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. seems to hang forever there. Is the rrdroot directory writable to the owner of the gmetad process? It should belong to e.g. nobody. This is a common mistake. yeah. it is writable. cheers Martin --- Ashutosh Mahajan [EMAIL PROTECTED] wrote: hello everyone, We are having problems installing ganglia version 3.0.4 with rrdtool-1.2.15. we can successfully do make, make install. gstat -a also seems to work. telnet localhost 8649 seems to throw out correct XML file. However, gmetad seems to be having some problems. telnet localhost 8652 seems to hang forever with the message: Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. if i access ganglia through the web, i get this message after a long long time: There was an error collecting ganglia data (192.168.1.1:8652): XML error: no element found at 1 rrd_rootdir also remains empty. what could be wrong? i can provide more details if necessary. thanks in advance. This message was sent using IMP, the Internet Messaging Program. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Two similar linux hosts provides different metrics
Hi Vitaly, where do you see the invalid numbers: a) in the gmond XML Stream (telnet/nc to the gmond XML port) b) in the XML Stream from gmetad (telnet/nc to the gmetad XML port) c) only in the web-frontend Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: NON-BUSY HOST: # ps axl|wc 61 8625865 # uptime 08:54:55 up 204 days, 2:00, 1 user, load average: 0.00, 0.00, 0.00 BUSY HOST ]# ps axl|wc 62 8775977 ]# uptime 08:55:18 up 31 days, 16:30, 1 user, load average: 0.04, 0.01, 0.00 -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Thursday, January 11, 2007 10:54 AM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, what does ps axl show on both hosts, as that is basically what gmond looks at? If it is already different there, the problem is not ganglia related. (OK, I see you already checked ...) What are the load averages according to uptime? Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: Hi, I have a weird problem - two linux hosts with similar configuration provide very different metrics about number of running processes - one shows about 2, and second about 20-40 (I speak about concentrated load graph at top right.) proc_total is different too - 171 vs. 350 (BTW, ps -ef |wc == 61 on both boxes) Both machines are RHEL3 kernel 2.4.21-37.ELsmp with ganglia-gmond-3.0.3-1 installed from RPM. Any ideas? Thanks, Vitaly -- --- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforge CID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Two similar linux hosts provides different metrics
Hi Vitaly, what does ps axl show on both hosts, as that is basically what gmond looks at? If it is already different there, the problem is not ganglia related. (OK, I see you already checked ...) What are the load averages according to uptime? Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: Hi, I have a weird problem - two linux hosts with similar configuration provide very different metrics about number of running processes - one shows about 2, and second about 20-40 (I speak about concentrated load graph at top right.) proc_total is different too - 171 vs. 350 (BTW, ps -ef |wc == 61 on both boxes) Both machines are RHEL3 kernel 2.4.21-37.ELsmp with ganglia-gmond-3.0.3-1 installed from RPM. Any ideas? Thanks, Vitaly - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Windows port issues
--- Vladimir Vuksan [EMAIL PROTECTED] wrote: matt massie wrote: you need to install the cygwin sunrpc package which is not installed by default during the cygwin install... That was it. I still wasn't able to compile 3.0.4 (xdr_create? can't be find) however 3.0.3 compiles with no problem. could you be more specific on the error message? Is it compile time, or link time? There is no such thing as xdr_create. Maybe xdrmem_create. Who is the person that packaged it initially since 3.0.3 corrects the Wait CPU issue ie. instead of showing 100% idle shows 100% Wait CPU. Also it may be nice to include gmetric. Hmm. What package are you refering to? There is no official windows (cygwin) binary distribution. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Compatibility mode for gmetad?
--- Jason Faulkner [EMAIL PROTECTED] wrote: Martin Knoblauch wrote: --- Jason Faulkner [EMAIL PROTECTED] wrote: I'm curious about how possible or difficult it would be to make gmetad backwards compatible -- i.e. where I could leave my 2.5.x gmond installations alone, and install 3.x gmetad on my main server (and be able to collect stats despite having a heterogeneous 2.5.x and 3.x environment). This would allow me to (hopefully) live-migrate my ganglia install up to the new version. -- Jason Faulkner Systems Manager Broadwick Corporation (919) 459-2509 Hi Jason, although we bumped the major number in the 2.5.x - 3.0 transition, we took care to not introduce incompatible changes to the core metrics framework. In short, I see no reason why a 3.0.4 gmetad should not be able to query 2.5.x gmond data. It should even be possible to have a 3.0.4 gmond listen to older gmonds. Of course, you are limited to multicast until you have replaced all gmonds. Jan 3 23:12:07 intranet1 ./gmetad[25006]: RRD_update (/var/lib/ganglia/rrds/Dev Login Servers/__SummaryInfo__/part_max_used.rrd): illegal attempt to update using time 1167883927 when last update time is 1167883927 (minimum one second step) I've been receiving repeated errors like this attempting to use a 3.0.x gmetad with a 2.5.7 gmond. The times are synced perfectly to a local NTP server, so I'm sure that's not the issue. Not an NTP issue, you are most likely right. The message tells that the current timestamp for the metrics in question did not change from the previous invocation of the call. Does this only happen on part_max_used, or are other metrics showing up as well? part_max_used is likely changeing very slow, this might be an indicator. also interesting to note that in your example the metrics is not a host, but a summary metrics. Does it prevent useful operation of the 3.0.x gmetad together with 2.5.7 gmonds? Or is it just annoying? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Windows port issues
--- Vladimir [EMAIL PROTECTED] wrote: Martin Knoblauch wrote: could you be more specific on the error message? Is it compile time, or link time? There is no such thing as xdr_create. Maybe xdrmem_create. Sorry I should have been more precise. It is a linking error. Here is the log gmond.o: In function `Ganglia_collection_group_send': /ganglia-3.0.4/gmond/gmond.c:1633: undefined reference to `_xdrmem_create' gmond.o: In function `main': /ganglia-3.0.4/gmond/gmond.c:897: undefined reference to `_xdrmem_create' /ganglia-3.0.4/gmond/gmond.c:828: undefined reference to `_xdr_free' /ganglia-3.0.4/gmond/gmond.c:912: undefined reference to `_xdr_free' ../lib/.libs/libganglia.a(libgmond.o): In function `Ganglia_gmetric_send': /ganglia-3.0.4/lib/libgmond.c:695: undefined reference to `_xdrmem_create' ../lib/.libs/libganglia.a(libgmond.o): In function `Ganglia_gmetric_send_spoof': /ganglia-3.0.4/lib/libgmond.c:748: undefined reference to `_xdrmem_create' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_value_types': /ganglia-3.0.4/lib/protocol_xdr.c:13: undefined reference to `_xdr_enum' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_gmetric_message': /ganglia-3.0.4/lib/protocol_xdr.c:23: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:25: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:27: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:29: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:31: undefined reference to `_xdr_u_int' /ganglia-3.0.4/lib/protocol_xdr.c:33: undefined reference to `_xdr_u_int' /ganglia-3.0.4/lib/protocol_xdr.c:35: undefined reference to `_xdr_u_int' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_spoof_header': /ganglia-3.0.4/lib/protocol_xdr.c:45: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:47: undefined reference to `_xdr_string' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_message_formats': /ganglia-3.0.4/lib/protocol_xdr.c:69: undefined reference to `_xdr_enum' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_message': /ganglia-3.0.4/lib/protocol_xdr.c:116: undefined reference to `_xdr_u_int' /ganglia-3.0.4/lib/protocol_xdr.c:124: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:151: undefined reference to `_xdr_float' /ganglia-3.0.4/lib/protocol_xdr.c:156: undefined reference to `_xdr_double' /ganglia-3.0.4/lib/protocol_xdr.c:95: undefined reference to `_xdr_u_short' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_25metric': /ganglia-3.0.4/lib/protocol_xdr.c:170: undefined reference to `_xdr_int' /ganglia-3.0.4/lib/protocol_xdr.c:172: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:174: undefined reference to `_xdr_int' /ganglia-3.0.4/lib/protocol_xdr.c:178: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:180: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:182: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:184: undefined reference to `_xdr_int' collect2: ld returned 1 exit status make[3]: *** [gmond.exe] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2 OK, seems ld is unable to find all of the xdr functions. Maybe someone removed a library from the library list. Although under Linux those functions are in libc. Hmm. What package are you refering to? There is no official windows (cygwin) binary distribution. Perhaps it is unofficial but it is on SourceForge e.g. http://downloads.sourceforge.net/ganglia/ganglia-3.0.0-setup.exe?modtime=1107790662big_mirror=0 Ah. I forgot about this one. And I do not recall who donated the work. I am adding the developers list. Apparently, the installer was never updated after the initial release. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Compatibility mode for gmetad?
--- Jason Faulkner [EMAIL PROTECTED] wrote: I'm curious about how possible or difficult it would be to make gmetad backwards compatible -- i.e. where I could leave my 2.5.x gmond installations alone, and install 3.x gmetad on my main server (and be able to collect stats despite having a heterogeneous 2.5.x and 3.x environment). This would allow me to (hopefully) live-migrate my ganglia install up to the new version. -- Jason Faulkner Systems Manager Broadwick Corporation (919) 459-2509 Hi Jason, although we bumped the major number in the 2.5.x - 3.0 transition, we took care to not introduce incompatible changes to the core metrics framework. In short, I see no reason why a 3.0.4 gmetad should not be able to query 2.5.x gmond data. It should even be possible to have a 3.0.4 gmond listen to older gmonds. Of course, you are limited to multicast until you have replaced all gmonds. Just try it out. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia+OpenBSD?
--- Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Tue, Dec 26, 2006 at 02:38:01PM -0500, Jason Faulkner wrote: Ooops -- sent first email directly to Martin instead of list. Martin Knoblauch wrote: Jason, apparently configure fails to realize that you are on OpenBSD, which is not supported currently. The unknown part is telling. I thought that might be the case. In order to support OpenBSD one needs to fix the recognition process in configure and add OpenBSD-specific metrics code to libmetrics. I'm confused though, according to this page: http://sourceforge.net/projects/ganglia/ ganglia runs on all openbsd platforms. I was going on the, apparently false, presumption that this meant the libmetrics code already existed for openbsd. not in 3.0.4, but I have a rough version that will be hopefully merged for 3.0.5 and that so far compiles and works (not all metrics though) in the hosts i have to test: OpenBSD 3.7 (i386) OpenBSD 4.0 (i386 and amd64)) IANAP, but if there's anything I can do to help get this working on OpenBSD, let me know. what versions/arch are you interested on?, would you be able to deploy test snapshots of ganglia on them? Carlo Carlo, I see no problem to add OpenBSD support in 3.0.5. Just go on and check it in once you are satisfied with your stuff. Just out of curiosity: how similar are the BSD flavours? We already have NetBSD and FreeBSD support in. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] gmond problem on SLES 10 x64 with floats
Hi Ludovic, do you happen to have some stange/unusual setting of your locale (LANG variable and friends) when you start the gmond executable? The output definitely looks broken. Could you please file a bug on bugzilla? Cheers Martin --- Ludovic Drolez [EMAIL PROTECTED] wrote: Hi ! I installed the official Ganglian RPM on a SLES 10 x64. My graphs are really strange, and the percentage values show random characters. I've just found that the problem is in gmond, which sends random strings in the XML dialog. I've tried to recompile gmond, but I have still the same problem. Here's some of the strace output: = accept(6, {sa_family=AF_INET, sin_port=htons(43998), sin_addr=inet_addr(127.0.0.1)}, [17179869200]) = 9 write(9, ?xml version=\1.0\ encoding=\ISO-8859-1\ standalone=\yes\?\n!DOCTYPE GANGLIA_XML [\n !ELEMENT G..., 2328) = 2328 write(9, GANGLIA_XML VERSION=\3.0.3\ SOURCE=\gmond\\n, 45) = 45 write(9, CLUSTER NAME=\cluster\ LOCALTIME=\1166087533\ OWNER=\unspecified\ LATLONG=\unspecified\ URL=\unspe..., 108) = 108 write(9, HOST NAME=\master.localdomain\ IP=\192.168.0.106\ REPORTED=\1166087527\ TN=\5\ TMAX=\20\ DMAX=\0\ ..., 150) = 150 write(9, METRIC NAME=\disk_total\ VAL=\1A.\332\326\260\ TYPE=\double\ UNITS=\GB\ TN=\1500\ TMAX=\1200\ DMAX=\0\ SLOP..., 125) = 125 write(9, METRIC NAME=\cpu_speed\ VAL=\2993\ TYPE=\uint32\ UNITS=\MHz\ TN=\300\ TMAX=\1200\ DMAX=\0\ SLOPE=\..., 122) = 122 write(9, METRIC NAME=\part_max_used\ VAL=\7y.\n\ TYPE=\float\ UNITS=\\ TN=\60\ TMAX=\180\ DMAX=\0\ SLOPE=\bo..., 120) = 120 write(9, METRIC NAME=\swap_total\ VAL=\4194296\ TYPE=\uint32\ UNITS=\KB\ TN=\300\ TMAX=\1200\ DMAX=\0\ SLOP..., 125) = 125 write(9, METRIC NAME=\os_name\ VAL=\Linux\ TYPE=\string\ UNITS=\\ TN=\300\ TMAX=\1200\ DMAX=\0\ SLOPE=\zero..., 118) = 118 write(9, METRIC NAME=\cpu_user\ VAL=\2.F\ TYPE=\float\ UNITS=\%\ TN=\20\ TMAX=\90\ DMAX=\0\ SLOPE=\both\ SO..., 114) = 114 write(9, METRIC NAME=\cpu_system\ VAL=\3.0\ TYPE=\float\ UNITS=\%\ TN=\20\ TMAX=\90\ DMAX=\0\ SLOPE=\both\ ..., 116) = 116 = As you can see, there's garbage for disk_total, part_max_used, cpu_user... So all values of type float or double, are not properly converted. The SLES runs under Qemu. I've also added some printfs in the host_metric_value and here's what I get: On the left the float converted by apr_* and on the right the prinf(%f) !!! VALUE =2.G= =2.343750= VALUE =2.G= =2.343750= VALUE =9.Ö= =93.487236= VALUE =0.6o= =0.64= VALUE =0.1;= =0.119600= VALUE =0.00= =0.000311= VALUE =0.0= =0.00= VALUE =0.0= =0.00= VALUE =9.ê= =95.312500= VALUE =0.9= =0.94= VALUE =0.4Y= =0.42= VALUE =0.1;= =0.113054= VALUE =0.00= =0.000536= Any ideas ? Cheers, -- Ludovic DROLEZ Linbox / FreeALter Soft www.linbox.com www.linbox.org - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] [Ganglia-developers] Correct counting of CPUs, Cores, Siblings (bz #84)
Hi Jarod, thanks. Your and Bens input were really useful for detecting patterns in 2.6 based configurations. What I now need is the output from 2.4 based configs. Only multi-core and/or HT-enabled systems actually. Thanks and have a Godd new Year 2007 Martin --- Jarod Wilson [EMAIL PROTECTED] wrote: On Friday 22 December 2006 11:05, Martin Knoblauch wrote: Hi Folks, in order to fix bz#84 for Linux, I would like to collect some data from different system configurations. Could you please create the file cpu.grep and execute the cat/grep chain below. Please report the results together with uname -a output which distro you are running. # more cpu.grep processor vendor model name physical id siblings core id cpu cores # cat /proc/cpuinfo | grep -f cpu.grep Here's the data from my Fedora Core 6 workstation in the office, since its fairly interesting for this specific topic. Its a dual-socket, dual-core Xeon system with hyperthreading turned on, so two sockets, four cores, eight logical cpus... Linux xavier.boston.redhat.com 2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:34:46 EST 2006 x86_64 x86_64 x86_64 GNU/Linux processor : 0 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 0 siblings: 4 core id : 0 cpu cores : 2 processor : 1 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 1 siblings: 4 core id : 0 cpu cores : 2 processor : 2 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 0 siblings: 4 core id : 1 cpu cores : 2 processor : 3 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 1 siblings: 4 core id : 1 cpu cores : 2 processor : 4 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 0 siblings: 4 core id : 0 cpu cores : 2 processor : 5 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 1 siblings: 4 core id : 0 cpu cores : 2 processor : 6 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 0 siblings: 4 core id : 1 cpu cores : 2 processor : 7 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 1 siblings: 4 core id : 1 cpu cores : 2 -- Jarod Wilson [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia+OpenBSD?
--- Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Wed, Dec 27, 2006 at 12:38:00AM -0800, Martin Knoblauch wrote: I see no problem to add OpenBSD support in 3.0.5. Just go on and check it in once you are satisfied with your stuff. checked it in already in revision 697. saw it. Just out of curiosity: how similar are the BSD flavours? We already have NetBSD and FreeBSD support in. I used NetBSD as a base from my port (as it is the closest), sadly they are not that similar as to just work with the other source as you can see by the diff. Understand. Btw. you should check the use of the strings NetBSD / FreeBSD in you patch :-) DragonflyBSD will be most likely closer to FreeBSD and the same for MacOS X (AKA Darwin), but I have no interest on adding those yet (DragonFlyBSD could be an interesting option for clusters, but I'd heard of no one using it in a cluster yet). You realize that we already have a Darwin port, although I do not know the quality/completeness of the metrics code. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.0.4 released
--- Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Mon, Dec 25, 2006 at 02:32:30AM -0800, Martin Knoblauch wrote: Ho ho ho, Santa just released version 3.0.4 of Ganglia. This is mainly a bugfix release. See the ChangeLog in the tarball for a complete list of changes. thanks Santa, and I got to be the first kid that went to the sourceforge tree for the nicely wrapped package :) which was far nicer than that Wii that Matt is probably still waiting to get a hold of. since I was running tests on the last SVN anyway, I got some more platforms where gmond/gmetric (and therefore libmetrics) were tested (*): * Gentoo Linux 2006.1 (amd64), Fedora Core 6 (i386) * Solaris 9 (sparc), Solaris 10 (i386, amd64 and sparc) * NetBSD 2.0.2 (i386), NetBSD 3.0 (i386), NetBSD 3.1 (i386, amd64) * FreeBSD 6.1 (amd64) Hi Carlo, thanks for the feedback. Could you just tell us which toolchains were used on the non-Linux platforms? Especially which compiler? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia+OpenBSD?
Jason, apparently configure fails to realize that you are on OpenBSD, which is not supported currently. The unknown part is telling. In order to support OpenBSD one needs to fix the recognition process in configure and add OpenBSD-specific metrics code to libmetrics. So I am afraid that it is not as easy as you believe. Btw. what is the output of config/config.guess? Cheers Martin --- Jason Faulkner [EMAIL PROTECTED] wrote: Anybody have even a direction to point me in? I'm at my wits end. Jason Faulkner wrote: I've been trying all morning (about 5 hours now, heh) to get Ganglia 3.0.3 to compile on OpenBSD to no avail. Here's the error it spits at me: ./configure --prefix=/opt ran without a hitch, but when I said make... /bin/sh ../libtool --tag=CC --mode=link /usr/bin/gcc -I.. -I. -I../srclib/expat/lib/ -I../srclib/libmetrics/ -I../srclib/apr/include/ -I../srclib/apr/include/arch/unix/ -I../srclib/confuse/src -g -O2 -Wall-o libganglia.la -rpath /opt/lib -version-info 0:0:0 -release 3.0.3 -export-dynamic become_a_nobody.lo debug_msg.lo daemon_init.lo file.lo dotconf.lo error.lo ganglia.lo hash.lo inetaddr.lo llist.lo my_inet_ntop.lo rdwr.lo readdir.lo tcp.lo protocol_xdr.lo apr_net.lo libgmond.lo -lkvm -lresolv -lpthread *** Warning: linker path does not have real file for library -lresolv. *** I have the capability to make that library automatically link in when *** you link to this library. But I can only do this if you have a *** shared version of the library, which you do not appear to have *** because I did check the linker path looking for a file starting *** with libresolv and none of the candidates passed a file format test *** using a regex pattern. Last file checked: /usr/lib//libresolv.a *** The inter-library dependencies that have been dropped here will be *** automatically added whenever a program is linked with this library *** or is declared to -dlopen it. /usr/bin/gcc -shared -fPIC -DPIC -o .libs/libganglia-3.0.3.so.0.0 .libs/become_a_nobody.o .libs/debug_msg.o .libs/daemon_init.o .libs/file.o .libs/dotconf.o .libs/error.o .libs/ganglia.o .libs/hash.o .libs/inetaddr.o .libs/llist.o .libs/my_inet_ntop.o .libs/rdwr.o .libs/readdir.o .libs/tcp.o .libs/protocol_xdr.o .libs/apr_net.o .libs/libgmond.o -lkvm -lpthread (cd .libs rm -f libganglia.so.0.0 ln -s libganglia-3.0.3.so.0.0 libganglia.so.0.0) ar cru .libs/libganglia.a become_a_nobody.o debug_msg.o daemon_init.o file.o dotconf.o error.o ganglia.o hash.o inetaddr.o llist.o my_inet_ntop.o rdwr.o readdir.o tcp.o protocol_xdr.o apr_net.o libgmond.o ranlib .libs/libganglia.a creating libganglia.la (cd .libs rm -f libganglia.la ln -s ../libganglia.la libganglia.la) Making all in srclib Making all in libmetrics make all-recursive Making all in unknown /bin/sh: cd: /usr/src/ganglia-3.0.3/srclib/libmetrics/unknown - No such file or directory *** Error code 1 Stop in /usr/src/ganglia-3.0.3/srclib/libmetrics (line 342 of Makefile). *** Error code 1 Stop in /usr/src/ganglia-3.0.3/srclib/libmetrics (line 204 of Makefile). *** Error code 1 Stop in /usr/src/ganglia-3.0.3/srclib (line 243 of Makefile). *** Error code 1 Stop in /usr/src/ganglia-3.0.3 (line 332 of Makefile). *** Error code 1 Stop in /usr/src/ganglia-3.0.3 (line 214 of Makefile). This is on OpenBSD 3.8. -- Jason Faulkner Systems Manager Broadwick Corporation (919) 459-2509 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia+OpenBSD?
--- Jason Faulkner [EMAIL PROTECTED] wrote: http://j.oldos.org/configguess.txt I feel less than smart. You wanted this, didn't you: :-) [EMAIL PROTECTED]:/usr/src/ganglia-3.0.3/config$ ./config.guess i386-unknown-openbsd3.8 guess this explains the unknown. But from the other follow-ups there seems to be hope for you. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
[Ganglia-general] New/Last Snapshot for 3.0.4
Hi, please have a look at the 2nd 3.0.4 snapshot located at: http://www.knobisoft.de/ganglia/ganglia-3.0.4.200609241751.tar.gz This snapshot contains the following changes compared to the last one: - fixup of the corrupted JPG images - move libmetrics to top-level in order to prepare removal of external sources in 3.1 - fix a stray debug message going to STDOUT instead of SDTERR - fix two stupid HP-UX syntax errors reported ages ago The full list of Changes is in the ChangeLog. There has not been a lot of feedback since the first snapshot. If nothing serious comes out during the next week, I will push out 3.0.4. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Problem with metrics
--- Ben Hartshorne [EMAIL PROTECTED] wrote: On Tue, Sep 19, 2006 at 03:11:26PM +0200, Rafal Masztalerz wrote: Hi I added some new metrics for my ganglia software using the gmetric command. When I run the webpage without parameters : http://computer/ganglia/ everything seems to be ok and I can choose my new metrics. But when I try to do other things on this page, for expample, when I choose some metric (bytes_out) then there are no my new metrics on the new/refreshed page. http://computer/ganglia/?m=bytes_outr=hours=descendingc=comph=sh=1hc=4 Rafael, Be careful that your metric only sends numbers. In some versions of ganglia, if your script that reports the gmetric accidentally sends letters instead, Bad Things(tm) happen. I wrote a script to parse the output of 'who' to count the number of logged in users, but I did it wrong. Occasionally it got a word instead of a number. This caused unexplained metric-loss throughout my gangila installation. A newer version of gmetric fixed this problem, but it is a good place to -ben -- Ben Hartshorne email: [EMAIL PROTECTED] http://ben.hartshorne.net - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general start looking. I'm sorry, but I don't remember what versions are affected. The fix for the gmetric bug went in on 25-Jan-2006. So, it should be in 3.0.3. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] First Snapshot for 3.0.4
--- Bernard Li [EMAIL PROTECTED] wrote: It is the first release after moving from CVS to SVN. Changes compared to 3.0.3 are: - Fix bz #110 by allowing higher sampling rates for cpu/net/load/mem in Linux/Cygwin. Likely needs similar changes in other platforms. - Add Yemis Host-Spoofing patch (bz #99) - Fix bz #77 (Diskless NFS Root not treated correctly) - Compile fixes for IRIX (bz #73/79) - Fix locking problems in gmetad (bz #56) - Fix incorrect writing of RRDs (bz #105) - Increases the number of rows in newly created RRAs (bz #33) - Better handling of bonding interfaces in Linux (bz #102/104) - Fix for network metrics overrun by Andreas Schoenfeld in AIX - SVN related cleanups in distribution targets - Take some of the proposed AIX changes from Micheal Perzl. The real stuff will come in 3.1.x I would also add: - Better RPM support for SUSE Linux 10.0/10.1 x86 and x86_64 Cheers, Bernard Oops. Sorry. Yes, the list is not neccessarily complete. I should also have mentioned the generated ChangeLog, which gives some more info. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] monitoring
Nagios? Cheers Martin --- Dirk Roessler [EMAIL PROTECTED] wrote: Does someone knows an easy to install and easy to use solution for monitoring and sending email notifications of down nodes and health state on a Linux HPC cluster? Dirk begin:vcard fn;quoted-printable:Dirk R=C3=B6=C3=9Fler n;quoted-printable:R=C3=B6=C3=9Fler;Dirk org:_University of Potsdam;Department of Geosciences adr:;;K.-Liebknecht-Str. 24/25;Golm/Potsdam;;14476;Germany email;internet:[EMAIL PROTECTED] title:Geophysicist tel;work:+49 331 977 5795 tel;fax:+49 331 977 5700 x-mozilla-html:FALSE url:http://www.geo.uni-potsdam.de/mitarbeiter/Roessler/roessler.html version:2.1 end:vcard - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia scaling testing?
-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057; dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Obtaining Immediate Interval Data From Ganglia
Correct. Below code limits the sampling rate for the cpu*, load*, mem* and net* graphs. Setting them to 0 will give you 1 second accuracy. Or nice furry graphs as Richard said (actually the furriness is what the original authors wanted to prevent :-). Personally I doubt that sampling load* and mem* at that rate. cpu* and net* may make sense. Richard, yes please file a report. Unfortunatelly I spoke to soon when I mentioned that we should get rid of the intervalls at all. Reason is that we need to compute differences for the cpu* and net* metrics (they are rates after all). If we want to have sub-second sampling rates, we need to use getimeofday instead of time. --- [EMAIL PROTECTED] wrote: If you do want to do fast polling on the Linux or cygwin gmond, I found some hardwired code in there which effectively limits the polling rate for some metrics no matter what you put in the config files. (Sorry martin, have not raised a bug report yet). Anyway: the code below is in the cygwin and linux metric.c files. typedef struct { uint32_t last_read; uint32_t thresh; char *name; char buffer[BUFFSIZE]; } timely_file; timely_file proc_stat= { 0, 15, /proc/stat }; timely_file proc_loadavg = { 0, 15, /proc/loadavg }; timely_file proc_meminfo = { 0, 30, /proc/meminfo }; timely_file proc_net_dev = { 0, 30, /proc/net/dev }; char *update_file(timely_file *tf) { int now,rval; now = time(0); if(now - tf-last_read tf-thresh) { rval = slurpfile(tf-name, tf-buffer, BUFFSIZE); if(rval == SYNAPSE_FAILURE) { err_msg(update_file() got an error from slurpfile() reading %s, tf-name); return (char *)SYNAPSE_FAILURE; } else tf-last_read = now; } return tf-buffer; } I have set those timeout values zero, which works well and gives me nice spiky furry graphs. - richard -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Troubles linking: Linux (SUSE 9.3) on Itanium (ia64, Altix)
On a RedHat-ish distro you would need to check that the RPMs for libpng *and* libpng-devel are installed. Not sure about SuSE though. Martin --- Ryurick Marius Hristev [EMAIL PROTECTED] wrote: Hello, I was trying to compile the ganglia package (rpm version) on the following system: SuSE 9.3 (Linux) running on Itaniums (ia64, SGI Altix ) and I am getting this error: gcc -O0 -I../lib -I../gmond -I../srclib/expat/lib/ -g -O2 -Wall -D_REENTRANT -o gmetad gmetad.o cmdline.o data_thread.o server.o process_xml.o rrd_helpers.o conf.o type_hash.o xml_hash.o cleanup.o ../lib/.libs/libganglia.a /usr/lib/librrd.a -lpng -lz -lm ../srclib/expat/lib/.libs/libexpat.a -ldl -lresolv -lnsl -lpthread /usr/lib/gcc-lib/ia64-suse-linux/3.3.3/../../../../ia64-suse-linux/bin/ld: cannot find -lpng but I do have a /usr/lib/libpng.so.3 Are there any known quirks with respect to my OS/Distro and CPU/Machine ? (I am new to this one, apologies if I missed something obvious). TIA Cheers, -- Ryurick M. Hristev -- Systems Administrator (Unix) University of Queensland -- ITS Dept. mailto: [EMAIL PROTECTED] the greatest hacking experience: hack your own mind -- me - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] changed ip
Hi Toney, my first guess would be that you are: a) using multicast and b) your default gateway goes via eth0 c) your compute nodes are on the 192.168.180.x network After the change the MC packets are still expected via eth0, but come in from eth1. Try adding this from the documentation: mcast_if=eth1 in your headnodes gmond.conf and route add -host 239.2.11.71 dev eth1 Hope this helps Martin --- toney samuel [EMAIL PROTECTED] wrote: I have a 4 node cluster. my head node has got two gigabit card and infiniband card my cluster ip is eth0 192.168.180.17/255.255.252.0 ipoib0 192.168.0.1/255.255.255.0 I have installed ganglia with this configuration. ganglia was working properly. later i changed my network configuration to this eth0 192.168.1.1/255.255.255.0 eth1 192.168.180.17/255.255.252.0 ipoib0 192.168.0.1/255.255.255.0 Now i can't see any information in my web page Pls guide how to resolve this issue. Regards. -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] not showing all hosts
--- Ian Cunningham [EMAIL PROTECTED] wrote: Solution B: increase the Time To Live or ttl on the gmond multicast packets. This assumes that multicast packets can get from one vlan to the other. The configuration option used to be available in the 2.x codebase, but I don't see it in 3.0.x code. I think it would be mcast_ttl but I can't say if that will work or not. it is ttl in the udp_send_channel section. It will be used, if mcast_join is set. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia History
Adam, do you still have those error messages? And: which version of the web-frontend are you using? We fixed quite a few of the php messages in 3.0.3. Martin --- Adam Brust [EMAIL PROTECTED] wrote: At the beginning of the month, ganglia/php were producing massive amounts of httpd errors which filled up my / partition causing the machine to crash... since then, I believe my ganglia history had been effected... I tried to restore from the three tar files located in /var/lib/ganglia/archives/ and each one only had about a weeks worth of history... I was able to restore from an earlier backup, which has my previous history, although now I am missing roughly these last three weeks. Also, I'm not certain if the problem is corrected now... I don't know if I'll lose this history again upon a reboot. -adam Martin Knoblauch wrote: Adam, that sounds OK. Do you see any messages in either /var/log/messages or in your webservers log files? Martin --- Adam Brust [EMAIL PROTECTED] wrote: Ian, Thanks for your reply. My rrd files appear to in the default /var/lib/ganglia directory, I could not find any other instances of them. gmetad is running as nobody and the rrds are owned by nobody... do you know if that's the correct user/permissions? thanks, adam Ian Cunningham wrote: Look at where gmetad is storing the rrd files now. You can find it in your gmetad.conf under rrd_rootdir. Maybe you didn't specify it for -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] New issue with hosts reporting
Hi Mark, you have configured a tcp_accept_channel for each of your two clusters master gmonds? Then you may need to define an acl for your gmetad server. Something like: tcp_accept_channel { port = 8649 acl { default = deny access { ip = ip-of-the-gmetad-server mask = 32 action = allow } } } Cheers Martin --- Mark Haney [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 David Zaltron wrote: Probably you have a gmond configuration on each node that muticast the cluster status to every node. For example, if you have a configuration like this in the nodes: - cluster { name = dummy_cluster } udp_send_channel { mcast_join = 239.2.11.71 port = 8649 } udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71 } This means that every node know to belong to the dummy_cluster, and every gmond can return the status of the entire cluster because it knows about every each other node (talking in the same multicast channel with each other) if telled at the default 8649 TCP port. You can find the solution unicasting the traffic between the node itself: udp_send_channel { host = hostname of 127.0.0.1 port = 8649 } udp_recv_channel { port = 8649 } --- In this way you can simulate a cluster of a single node, monitoring in reality the single node. Okay, I did that and that /sort of/ fixed it, except for now I do not see the nodes in my web interface. Keep in mind the web interface is running on a completely separate box that's not either newton or winterstar. So, how do I get the node showing up in the web interface now? (And David, I apologize for sending to you and not the list, my fingers got ahead of me today.) - -- Fere libenter homines id quod volunt credunt. Mark Haney Sr. Systems Administrator ERC Broadband (828) 350-2415 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFEhDXZYQhnfRtc0AIRAj07AJwNaTsNHM02oJaznXnO0qECZEPZUwCfa6JR 0rLX5KWkRW9MjL/5/J/Igj0= =iIJp -END PGP SIGNATURE- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia History
Adam, that is unexpected. The RRDs are supposed to keep one year (the default) of history. Martin --- Adam Brust [EMAIL PROTECTED] wrote: I recently had to reboot the Front End of my cluster... upon the reboot, my Ganglia history is gone... Gangila is only keeping data from the time of the reboot... it was nearly a years worth of history... can anyone offer any suggestions? thanks, -adam ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia 3.0.3 compilation on AIX 5.2
Hi Knut, there is supposed to be a README.AIX file in the 3.0.3 distribution. This explains a few things. Basically, building with xlc is not supported. There are a few hints on how to do it under 2.) And you absolutely need to build non-shared. That is where most likely your core-dump comes from. Explained under 1) Cheers Martin --- Knut Helleb� [EMAIL PROTECTED] wrote: Regards, I'm trying to compile Ganglia 3.0.3 on an AIX 5.2 box using the native IBM compiler and have encountered two problems compiling and one fatal when running gmond. Compilation problems: 1. The compilation breaks on the file ./srclib/confuse/src/lexer.c at line 786 which stems from the lex file lexer.l line 82: #line 82 lexer.l cfg-line++; /* keep track of line number */ YY_BREAK saying undeclared identifier cfg. I put in a cfg_t *cfg; declaration in line 696 and then the compilation proceeds. 2. Also, I need to use the -qcpluscmt switch allowing C++ comment style or else the compilation bombs in gmond.c 3. Running gmond always crashes with a SIGSEGV. The trace shows that the crash occurs when opening the /etc/gmond.conf file. A dbx session on the core file shows the crash seems to be related to the parser file fix i did in section 1. above. Here's the backtrace: (dbx) where cfg_yylex() at 0x1000af28 cfg_parse_internal() at 0x1000821c cfg_parse_fp() at 0x1000a5a0 cfg_parse() at 0x1000a684 Ganglia_gmond_config_create() at 0x10006d58 process_configuration_file() at 0x100036dc main() at 0x14b4 What's up here ? -- ** * Knut Helleb� | DAMN GOOD COFFEE !! * * Hydro IS Partner ESI (Unix) Team | (and hot too) * * | * * E-mail: [EMAIL PROTECTED] | Dale Cooper, FBI * ** *** NOTICE: This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential or privileged information. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and delete the e-mail and attached documents. Thank you. *** -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de