Bernard Li wrote:
>Instead of running gmetad on multiple ports of one server, have you
>tried running gmetad on the cluster headnodes and then have them
>aggregate the data to the current gmetad server? You can do this by
>having a data_source entry like the following:
>
># data_source "my grid" 50 grid.org:8651
>
>Where grid.org:8651 refers to gmetad running on the headnode of one of
>your cluster.
>
>(This is what I refer to as federation)
>
>Cheers,
>
>Bernard
>
>
No, I haven't tried that. Can you explain why you think that might be
better? Maybe we can fix it another way.
This suggestion is less than perfect because the clusters are very
generic compute machines; aside from what ganglia wants to enforce, none
of them are any more special than the others. There are no natural
headnodes, and running gmetad on one of them would comprimise the
robustness of the architecture. They get rebooted, overloaded,
reinstalled, etc.... I'd prefer not to have to make them more special
than they are now; that would add new management issues.
I don't claim to have a clue at understanding the code -- but in
gmetad.c, there's this phrase:
for(;;)
{
/* Do at a random interval between 10 and 30 sec. */
sleep_time = 10 + ((30-10)*1.0) * rand()/(RAND_MAX + 1.0);
sleep(sleep_time);
/* Flush the old values */
hash_foreach(root.metric_summary, zero_out_summary, NULL);
root.hosts_up = 0;
root.hosts_down = 0;
/* Sum the new values */
hash_foreach(root.authority, do_root_summary, NULL );
/* Save them to RRD */
hash_foreach(root.metric_summary, write_root_summary, NULL);
}
My understanding is that the php code connects to the running gmetad and
asks for data from it at random times. What happens if it asks after
root.hosts_up is set to 0, and before the do_root_summary returns, or
before it is called? I see a mutex inside do_root_summary, and
READ_LOCK in hash_foreach. Are one of these used to block readers
during this period? If so, what protects the readers outside it?
(I've seen zero'd summaries, summaries that match one of the clusters,
and some other numbers...)
(Am I even looking in the right place???)
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general