This isn't specific to Ganglia, I ran into the same issue with Cacti
years ago before starting to use the cacti boost plugin.  This can be
worse with Ganglia however because you are typically collecting many
more metrics and updating rrd files at a much higher interval.

In my environment I didn't want to use a ram disk due to the risk of
losing data.  The method that worked for me was to create a filesystem
on a file and then use a loop mount to mount the filesystem under
/var/lib/ganglia/rrds.  The system then sees it as a single file i/o
operation when you first open the directory, not an individual i/o op
every time an rrd is updated.

I have not done this with my current server yet, but I'm getting close
to the point where I am going to need to...

There was a discussion about this on the mailing list ages ago:
http://www.mail-archive.com/[email protected]/msg00553.html

Bernard Li wrote:
> Hi Weston:
>
> On Fri, Oct 15, 2010 at 9:18 AM, Stevens, Weston J
> <[email protected]> wrote:
>
>   
>> I realize I should probably be a lot clearer.
>>
>> Our environment/setup is Ganglia 3.1.7 RRDTool version 1.2.23 3 node 
>> physical cluster running distribution CentOS release 5.3 (Final) on an x86 
>> 64-bit architecture. All 3 nodes on our cluster share the same RAID 10 disks 
>> for RAID 1 pairs cciss/c0d0 and cciss/c0d1 I believe.
>>
>> It would seem the more metrics we add, the more problems we get. Ganglia may 
>> be scalable for hundreds of nodes, but perhaps not hundreds of additional 
>> custom metrics (all written in C)?
>>
>> After adding dozens of metrics we encounter problems more and more, IE 
>> graphs (RRDs) missing, sporadic recording of data or data not being 
>> collected at all, and so forth. The head node never seems to have these 
>> problems however, just the worker nodes. Thus this may mean it's a 
>> networking issue. The built-in default metrics are not exempt from these 
>> problems, while they always work fine by themselves, they can be "messed 
>> with" by the new metrics, if that's the appropriate term.
>>
>> I wrote a script that reboots Ganglia until ALL the RRDs get created, 
>> otherwise gmetad will not always create them all on a single boot (sometimes 
>> it will). This seems to help a lot in kickstarting things and encouraging 
>> things to work, but I still seem to encounter problems where metrics are not 
>> getting collected from the worker nodes even with all the RRDs reporting for 
>> duty.
>>
>> Perhaps the traditional gmetric cron job is the thing to use? I mean, this 
>> new way of collecting custom metrics may still be unreliable?
>>     
>
> Thanks for providing more information on your problem.
>
> I can tell you from blog posts, that Facebook is supposedly tracking 5
> million metrics with Ganglia.  I don't know how many hosts they have,
> nor their polling intervals, but at least this seems to be doable:
>
> http://linuxsysadminblog.com/2010/09/a-day-in-the-life-of-facebook-operations
>
> Granted, I do not know how they are injecting metrics.  They could be
> using the traditional gmetric, Python DSO or C DSO.
>
> I can tell you that by far, the Python DSO is more popular with our
> users, at least for users running Ganglia 3.1.
>
> I'd like to investigate this further, so could you please provide us
> with the number of C modules each host has and also the total number
> of metrics your gmetad is tracking.  What would also be good for
> troubleshooting is for you to disable all the default metrics and
> simply run the C modules, and see how many you can run stably until it
> starts becoming unstable.  Once I have that data I will try to
> reproduce this.  Also if you could make your C modules available, that
> would be helpful as well.  One last thing, try running strace against
> gmond and/or run it in debug mode -- perhaps it would give us more
> insight into the problem.
>
> Cheers,
>
> Bernard
>
> ------------------------------------------------------------------------------
> Download new Adobe(R) Flash(R) Builder(TM) 4
> The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly 
> Flex(R) Builder(TM)) enable the development of rich applications that run
> across multiple browsers and platforms. Download your free trials today!
> http://p.sf.net/sfu/adobe-dev2dev
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>   


-- 
Dan Rich <[email protected]> |   http://www.employees.org/~drich/
                               |  "Step up to red alert!"  "Are you sure, sir?
                               |   It means changing the bulb in the sign..."
                               |          - Red Dwarf (BBC)

Attachment: signature.asc
Description: OpenPGP digital signature

------------------------------------------------------------------------------
Download new Adobe(R) Flash(R) Builder(TM) 4
The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly 
Flex(R) Builder(TM)) enable the development of rich applications that run
across multiple browsers and platforms. Download your free trials today!
http://p.sf.net/sfu/adobe-dev2dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to