Hi Weston:

On Fri, Oct 15, 2010 at 9:18 AM, Stevens, Weston J
<[email protected]> wrote:

> I realize I should probably be a lot clearer.
>
> Our environment/setup is Ganglia 3.1.7 RRDTool version 1.2.23 3 node physical 
> cluster running distribution CentOS release 5.3 (Final) on an x86 64-bit 
> architecture. All 3 nodes on our cluster share the same RAID 10 disks for 
> RAID 1 pairs cciss/c0d0 and cciss/c0d1 I believe.
>
> It would seem the more metrics we add, the more problems we get. Ganglia may 
> be scalable for hundreds of nodes, but perhaps not hundreds of additional 
> custom metrics (all written in C)?
>
> After adding dozens of metrics we encounter problems more and more, IE graphs 
> (RRDs) missing, sporadic recording of data or data not being collected at 
> all, and so forth. The head node never seems to have these problems however, 
> just the worker nodes. Thus this may mean it's a networking issue. The 
> built-in default metrics are not exempt from these problems, while they 
> always work fine by themselves, they can be "messed with" by the new metrics, 
> if that's the appropriate term.
>
> I wrote a script that reboots Ganglia until ALL the RRDs get created, 
> otherwise gmetad will not always create them all on a single boot (sometimes 
> it will). This seems to help a lot in kickstarting things and encouraging 
> things to work, but I still seem to encounter problems where metrics are not 
> getting collected from the worker nodes even with all the RRDs reporting for 
> duty.
>
> Perhaps the traditional gmetric cron job is the thing to use? I mean, this 
> new way of collecting custom metrics may still be unreliable?

Thanks for providing more information on your problem.

I can tell you from blog posts, that Facebook is supposedly tracking 5
million metrics with Ganglia.  I don't know how many hosts they have,
nor their polling intervals, but at least this seems to be doable:

http://linuxsysadminblog.com/2010/09/a-day-in-the-life-of-facebook-operations

Granted, I do not know how they are injecting metrics.  They could be
using the traditional gmetric, Python DSO or C DSO.

I can tell you that by far, the Python DSO is more popular with our
users, at least for users running Ganglia 3.1.

I'd like to investigate this further, so could you please provide us
with the number of C modules each host has and also the total number
of metrics your gmetad is tracking.  What would also be good for
troubleshooting is for you to disable all the default metrics and
simply run the C modules, and see how many you can run stably until it
starts becoming unstable.  Once I have that data I will try to
reproduce this.  Also if you could make your C modules available, that
would be helpful as well.  One last thing, try running strace against
gmond and/or run it in debug mode -- perhaps it would give us more
insight into the problem.

Cheers,

Bernard

------------------------------------------------------------------------------
Download new Adobe(R) Flash(R) Builder(TM) 4
The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly 
Flex(R) Builder(TM)) enable the development of rich applications that run
across multiple browsers and platforms. Download your free trials today!
http://p.sf.net/sfu/adobe-dev2dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to