Bugs item #639482, was opened at 2002-11-16 17:57
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=434892&aid=639482&group_id=43021

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jeff Squyres (jsquyres)
Assigned to: Nobody/Anonymous (nobody)
Summary: ganglia reports live nodes are dead

Initial Comment:
Setup:
- Tested on RH 7.2 and 7.3
- Installations in vmware with OSCAR 2.0 (CVS, to be
released shortly...) with Steve Duchene's packaging of
ganglia 2.5.0
- One "head" node and 2 "cluster" nodes
- Ganglia web page is served off the head node

After an initial install / boot of the cluster nodes,
all three nodes appear on the ganglia web page as healthy.

A short time later, the two cluster nodes are marked as
down.  This is a repeatable problem.

Restarting the gmond on the head node ("service gmond
restart") causes the two cluster nodes to disappear
from the ganglia web page.

Restarting the gmond on the two cluster nodes makes
them appear up and healthy on the ganglia web page, and
they don't seem to disappear, even after [relatively]
long periods of uptime. Specifically, heartbeats appear
to come in at regular intervals (according to the
ganliga web page).

When a cluster node is marked as down, rebooting the
cluster nodes makes it be marked up briefly, and then
eventually dead again (i.e., no heartbeat for over 60
seconds).  

All of these cases definitely exhibit a few heartbeats
in the beginning (I don't know how many, but it's at
least 1), but then the heartbeats inexplicably stop,
and therefore the nodes get marked down.

We have randomly seen this problem on real machine
installs as well (vs. vmware installs).  I don't have
any firm data on that, though, although the reports
came from credible testing personnel.

I don't know if this is a vmware-ism, or an OSCAR-ism,
or something else.  This problem has plagued OSCAR for
quite a while now, and any insight that you guys could
provide would be most helpful.

One note about vmware installs: the time on the cluster
nodes is way off (compared to the head node), and there
doesn't seem to be a way to fix it.  The head node has
the correct time (set by ntp to an external server),
but the two nodes are never right, and resist all
attempts to change there time (there appears to be
something in vmware that repeatedly and aggressively
sets the time to something way off from reality).  I am
using vmware workstation version 3.1.

I'd be happy to offer temporary access to machines
(i.e., vmware instances) if any ganglia developers
would benefit from poking around to see what's going wrong.

----------------------------------------------------------------------

>Comment By: Jeff Squyres (jsquyres)
Date: 2002-12-01 23:18

Message:
Logged In: YES 
user_id=11722

It does not explain, however, why *sometimes* it works, and
sometimes it doesn't. 

For example: after a reboot (where the gmond's are
automatically started), we see the one-and-only-one ping
behavior.  But if I go manually restart the gmonds, all is
well (i.e., pings happen continuously and regularly).

Any other ideas?

----------------------------------------------------------------------

Comment By: Federico David Sacerdoti (sacerdoti)
Date: 2002-11-30 20:12

Message:
Logged In: YES 
user_id=581045

It sounds like a problem we have seen before. In our case, the ethernet 
switch did not support multicast and the IGMP protocol. It would stifle 
all multicast packets except the first one, and we would see the same 
behavior you describe.

Hope this helps,
Federico

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-11-29 10:21

Message:
Logged In: YES 
user_id=11722

It's been about 2 weeks -- does anyone have any insight on
this problem?

Thanks.

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-11-16 18:00

Message:
Logged In: YES 
user_id=11722

Correction: this is actually ganglia 2.5.1, not 2.5.0. 
Sorry for the confusion.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=434892&aid=639482&group_id=43021

Reply via email to