On Fri, Dec 11, 2009 at 1:34 PM, Daniel Pocock <dan...@pocock.com.au> wrote:
> Thanks for sharing this - could you comment on the total number of RRDs per
> gmetad, and do you use rrdcached?

the largest colo has 140175 rrds and we use the tmpfs + cron hack, no rrdcached.

> I was thinking about gmetads attached to the same SAN, not a remote FS over
> IP.  In a SAN, each gmetad has a physical path to the disk (over fibre
> channel) and there are some filesystems (e.g. GFS) and locking systems (DLM)
> that would allow concurrent access to the raw devices.  If two gmetads mount
> the filesystem concurrently, you could tell one gmetad `stop monitoring
> cluster A, sync the RRDs' and then tell the other gmetad to start monitoring
> cluster A.
>
> DLM is quite a heavyweight locking system (cluster manager and heartbeat
> system required), some enterprises have solutions like Apache Zookeeper
> (Google has one called Chubby) and they can potentially allow the gmetad
> servers to agree on who is polling each cluster.

I see, and while I'm sure this solution works for many people and
might be popular in HPC environments I'm not really keen it's
something we'd want to go with ourselves, we tend to stick to "share
nothing" design which I realize has cons too, but as always it's a
matter of tradeoffs and even an implementation of paxos like chubby is
no silver bullet.

The other thing is of course costs. SANs aren't free and if I'm a
small gig, but for some reasons I actually have a clue and recognize
the importance of instrumenting everything, I wouldn't want to be
forced to having to add shared storage for the purpose of not losing
data.

>> I see two possible solutions:
>> 1. client caching
>> 2. built-in sync feature
>>
>> In 1. gmond would cache data locally if it could not contact the
>> remote end. This imho is the best solution because it helps not only
>> with head failures and maintenance, but possibly addresses a whole
>> bunch of other failure modes too.
>>
>
> The problem with that is that the XML is just a snapshot.  Maybe the XML
> could contain multiple values for each metric, e.g. all values since the
> last poll?  There would need to be some way of limiting memory usage too, so
> that an agent doesn't kill the machine if nothing is polling it.

indeed, os resources usage for caching should be tightly controlled.
RRD does a pretty good job at that, and for example I know people that
use collectd (which supports multiple output streams) and send data
both remotely and keep a local copy with different retention policies
to solve that problem.

> This would be addressed by the use of SAN - there would only be one RRD
> file, and the gmetad servers would need to be in some agreement so that they
> both don't try to write the same file at the same time.

sure, but even with a SAN you'd have to add some intelligence to
gmetad, which from my pov is more than half of the work needed to
achieve gmetad reliability and redundancy while keeping it's current
distributed design.


-- 
"Behind every great man there's a great backpack" - B.

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to