Hey Daniel I think my supervisor would like to know why this would improve
things? Otherwise I don't think he'll bite.
Also I'm wondering if a script like this would be the way to implement what you
just mentioned:
/etc/init.d/gmond stop
/etc/init.d/gmetad stop
rm -rf /var/lib/ganglia/rrds/*
dd if=/dev/zero of=ext3.img bs=1k count=1048576 (1 gigabyte)
mke2fs -f ext3.img
mount -o loop ext3.img /var/lib/ganglia/rrds
/etc/init.d/gmond start
/etc/init.d/gmetad start
Warm regards,
Wes
________________________________
From: Daniel Rich [mailto:[email protected]]
Sent: Friday, October 15, 2010 11:56 AM
To: Bernard Li
Cc: Stevens, Weston J; [email protected]
Subject: Re: [Ganglia-general] Booting Ganglia becoming a hassle
This isn't specific to Ganglia, I ran into the same issue with Cacti years ago
before starting to use the cacti boost plugin. This can be worse with Ganglia
however because you are typically collecting many more metrics and updating rrd
files at a much higher interval.
In my environment I didn't want to use a ram disk due to the risk of losing
data. The method that worked for me was to create a filesystem on a file and
then use a loop mount to mount the filesystem under /var/lib/ganglia/rrds. The
system then sees it as a single file i/o operation when you first open the
directory, not an individual i/o op every time an rrd is updated.
I have not done this with my current server yet, but I'm getting close to the
point where I am going to need to...
There was a discussion about this on the mailing list ages ago:
http://www.mail-archive.com/[email protected]/msg00553.html
Bernard Li wrote:
Hi Weston:
On Fri, Oct 15, 2010 at 9:18 AM, Stevens, Weston J
<[email protected]><mailto:[email protected]> wrote:
I realize I should probably be a lot clearer.
Our environment/setup is Ganglia 3.1.7 RRDTool version 1.2.23 3 node physical
cluster running distribution CentOS release 5.3 (Final) on an x86 64-bit
architecture. All 3 nodes on our cluster share the same RAID 10 disks for RAID
1 pairs cciss/c0d0 and cciss/c0d1 I believe.
It would seem the more metrics we add, the more problems we get. Ganglia may be
scalable for hundreds of nodes, but perhaps not hundreds of additional custom
metrics (all written in C)?
After adding dozens of metrics we encounter problems more and more, IE graphs
(RRDs) missing, sporadic recording of data or data not being collected at all,
and so forth. The head node never seems to have these problems however, just
the worker nodes. Thus this may mean it's a networking issue. The built-in
default metrics are not exempt from these problems, while they always work fine
by themselves, they can be "messed with" by the new metrics, if that's the
appropriate term.
I wrote a script that reboots Ganglia until ALL the RRDs get created, otherwise
gmetad will not always create them all on a single boot (sometimes it will).
This seems to help a lot in kickstarting things and encouraging things to work,
but I still seem to encounter problems where metrics are not getting collected
from the worker nodes even with all the RRDs reporting for duty.
Perhaps the traditional gmetric cron job is the thing to use? I mean, this new
way of collecting custom metrics may still be unreliable?
Thanks for providing more information on your problem.
I can tell you from blog posts, that Facebook is supposedly tracking 5
million metrics with Ganglia. I don't know how many hosts they have,
nor their polling intervals, but at least this seems to be doable:
http://linuxsysadminblog.com/2010/09/a-day-in-the-life-of-facebook-operations
Granted, I do not know how they are injecting metrics. They could be
using the traditional gmetric, Python DSO or C DSO.
I can tell you that by far, the Python DSO is more popular with our
users, at least for users running Ganglia 3.1.
I'd like to investigate this further, so could you please provide us
with the number of C modules each host has and also the total number
of metrics your gmetad is tracking. What would also be good for
troubleshooting is for you to disable all the default metrics and
simply run the C modules, and see how many you can run stably until it
starts becoming unstable. Once I have that data I will try to
reproduce this. Also if you could make your C modules available, that
would be helpful as well. One last thing, try running strace against
gmond and/or run it in debug mode -- perhaps it would give us more
insight into the problem.
Cheers,
Bernard
------------------------------------------------------------------------------
Download new Adobe(R) Flash(R) Builder(TM) 4
The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly
Flex(R) Builder(TM)) enable the development of rich applications that run
across multiple browsers and platforms. Download your free trials today!
http://p.sf.net/sfu/adobe-dev2dev
_______________________________________________
Ganglia-general mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/ganglia-general
--
Dan Rich <[email protected]><mailto:[email protected]> |
http://www.employees.org/~drich/
| "Step up to red alert!" "Are you sure, sir?
| It means changing the bulb in the sign..."
| - Red Dwarf (BBC)
------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general