Hello,

I'm working on a project to use ganglia to monitor a cluster of application
servers. Time permitting, I'm interested in any thoughts you can share on
the following issues.

1) There are disk_total, disk_free and part_max_used metrics defined for
Linux. We're building a storage-heavy cluster, with multiple volumes on each
node. Are these three metrics sufficient? Can one report a vector metric? I
suppose we could do that as text. Are there any other considerations you can
think of for monitoring multiple disks?

2) I mentioned this in a previous post, and I appreciate Matt pointing out
that I can add any metrics I need via gmetric. What guiding principles
should one use in adding failure metrics?
Failures are rare, therefore not necessary to monitor them?
Don't really care about failures themselves, the tip-off is loss of
application performance?
Do failures typically show up in metrics that are already present (e.g. if
the one disk fails)?
What other thoughts have you had about failure metrics?
What top three failure metrics would you like to see for your clusters?

3) Are you familiar with the NGOP project at FermiLab? If so, do you have
any quick comments about that project vs. ganglia? You can find the Users
Guide at http://www-isd.fnal.gov/ngop/. I don't know if it is open source.

Thanks for your thoughts.

Jonathan

Reply via email to