Bugs item #639482, was opened at 2002-11-16 17:57
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=434892&aid=639482&group_id=43021

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jeff Squyres (jsquyres)
Assigned to: Nobody/Anonymous (nobody)
Summary: ganglia reports live nodes are dead

Initial Comment:
Setup:
- Tested on RH 7.2 and 7.3
- Installations in vmware with OSCAR 2.0 (CVS, to be
released shortly...) with Steve Duchene's packaging of
ganglia 2.5.0
- One "head" node and 2 "cluster" nodes
- Ganglia web page is served off the head node

After an initial install / boot of the cluster nodes,
all three nodes appear on the ganglia web page as healthy.

A short time later, the two cluster nodes are marked as
down.  This is a repeatable problem.

Restarting the gmond on the head node ("service gmond
restart") causes the two cluster nodes to disappear
from the ganglia web page.

Restarting the gmond on the two cluster nodes makes
them appear up and healthy on the ganglia web page, and
they don't seem to disappear, even after [relatively]
long periods of uptime. Specifically, heartbeats appear
to come in at regular intervals (according to the
ganliga web page).

When a cluster node is marked as down, rebooting the
cluster nodes makes it be marked up briefly, and then
eventually dead again (i.e., no heartbeat for over 60
seconds).  

All of these cases definitely exhibit a few heartbeats
in the beginning (I don't know how many, but it's at
least 1), but then the heartbeats inexplicably stop,
and therefore the nodes get marked down.

We have randomly seen this problem on real machine
installs as well (vs. vmware installs).  I don't have
any firm data on that, though, although the reports
came from credible testing personnel.

I don't know if this is a vmware-ism, or an OSCAR-ism,
or something else.  This problem has plagued OSCAR for
quite a while now, and any insight that you guys could
provide would be most helpful.

One note about vmware installs: the time on the cluster
nodes is way off (compared to the head node), and there
doesn't seem to be a way to fix it.  The head node has
the correct time (set by ntp to an external server),
but the two nodes are never right, and resist all
attempts to change there time (there appears to be
something in vmware that repeatedly and aggressively
sets the time to something way off from reality).  I am
using vmware workstation version 3.1.

I'd be happy to offer temporary access to machines
(i.e., vmware instances) if any ganglia developers
would benefit from poking around to see what's going wrong.

----------------------------------------------------------------------

>Comment By: Jeff Squyres (jsquyres)
Date: 2003-01-13 09:13

Message:
Logged In: YES 
user_id=11722

After iterating a little in e-mail with Matt and diving in
the gmond source code, I discovered that this is *NOT* a
multicasting problem.  It is also not specific to vmware (we
use vmware for cluster configuration testing, which is why
this problem showed up: vmware chooses to almost always show
this problem because of particular brain-deadness in vmware
;-) ).

What happens is the following:

- gmond launches at boot time (at time T=Tstart)
- gmond establishes time thresholds as to when to send the
next heartbeats (i.e., when T >= Tstart + small_number)
- the system time gets reset to some time far in the past
(so now T = Tstart - some_huge_value)
- gmond's time thresholds will still fire when T >= Tstart +
small_number, but now that will be a long time from now
(specifically, (some_huge_number + small_number) seconds
from now -- a value which is certainly greater than the
maximum allowed heartbeat timeout).  This effectively makes
the gmond go silent for a long time, and the head node marks
the node down.

The system time typically changes for one of three reasons:

1. NTP changes it
2. A human system admin changes it
3. vmware resets it

Hence, this is not a vmware-specific problem.  For example,
on machines in time zones where time resets back an hour
every year, ganglia will mark them all down for an hour when
NTP resets the time.

The fix is pretty simple -- in the main loop for the client
gmond, simply watch for the return value of time() to go
backwards.  If it does, manually trigger sending a new
heartbeat and resetting all time thresholds.  I sent a patch
to Matt; he indicated that he would try to work it into a
next minor release (if there is one), and that Ganglia 3.x's
different design doesn't suffer from this problem.

So I think the problem is solved -- just wanted to update
the bug on SF.

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-12-20 10:02

Message:
Logged In: YES 
user_id=11722

Greetings.  It's been another 1.5 weeks -- does anyone have
any insight on this problem?

As requested, my tcpdump output is attached to this bug (it
doesn't come across in the e-mail -- you have to visit the
web page and scroll down to the bottom).

Thanks.

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-12-09 12:50

Message:
Logged In: YES 
user_id=11722

Forgot to check the SF "upload and attach file" checkbox, so
the file didn't attach last time -- sorry.  Here it is.

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-12-09 12:49

Message:
Logged In: YES 
user_id=11722

Same setup: 3 vmware nodes, RH 7.3, one "head" node, two
"client" nodes.

When all is well (i.e., nodes are marked as up, etc.),
"tcpdump ether multicast" reports lots of multicast activity
from all three nodes.

However, sometimes the multicast activity mysteriously stops
from the client nodes.  That is, there was activity for a
while, and then it just stops.  I say this to clarify
previous comments -- there wasn't just "one" multicast ping,
there seemed to be a bunch, and then it just stopped.  

I ran "tcpdump ether multicast" on the head node the whole
time and observered multicast activity from the head node
during the entire timeframe.  When the multicast activity
from the client nodes stopped, I logged in and ran "tcpdump
ether multicast" on each node.  I was working on the
assumption that the client nodes would still be sending
multicast data, but because of switching issues, somehow
would not "see" the head node.  

Unfortunately, this was not the case; the client nodes could
still see all the multicast activity from the head node --
they just weren't generating any.  I should note that the
client nodes were sending/receiving ARPs during this time --
so it looks like their multicast capability and their
interaction with the switch is still ok.  ps on the client
nodes confirmed that the gmond is still running.

I have attached the output (from the head node) of "tcpdump
ether multicast" showing activity from the head node
(queegvm.oscar.vmware), followed by the boot up activity of
oscarnode2.oscar.vmware, several multicast ganglia "pings"
from oscarnode2, and then it eventually stops.  Note that
after oscarnode2's ganglia multicast activity stops, there's
still one more ARP from oscarnode2.  So it seems as if
oscarnode2 is still successfully multicasting -- it's just
that its gmond went silent.  At least, that's a guess...

Is there something else that I should check?

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-12-04 16:24

Message:
Logged In: YES 
user_id=11722

Hi Mason -- thanks for the reply.

Please re-read my initial description and my replies: after
an initial failure (i.e., one-and-only-one ping), if I
restart the gmond's, ganglia works fine (i.e., lots of pings
on a regular basis).  I personally have only tested under
vmware, but others have tested on real clusters and are
seeing the same behavior.

I'm not much of a multicast programmer (read: not at all). 
Is there some kind of canonical test program that I can use?

In the mean time, I'll check and see what tcpdump is saying
about multicast behavior and get back to you.

----------------------------------------------------------------------

Comment By: Mason Katz (masonkatz)
Date: 2002-12-04 16:18

Message:
Logged In: YES 
user_id=463741

We've seen several switches that allow only the first multicast message out of 
the switch and then drop everything else because IGMP is not configured by 
default.  Unfortunately getting this right is different on all switches, but we 
have yet to find a switch that doesn't work.

If this is just a VMWare cluster problem then first build a real cluster and 
make sure the software works for you the way everyone else uses it.  Then 
figure out what the VMWare virtual network layer is actually doing.

Is this only a problem on virtual machines all on the same physical host?  Is 
this a problem with a real cluster on nodes all running a single session of 
VMWare?

I know VMWare has several different network models.  Maybe you've just selected 
the wrong one.  It's tricky since the multicast support probably needs to be in 
the VMWare loop-back layer.  If it isn't the switch your machine is connected 
to will need to be configured correctly.  Maybe this is the problem, are you 
build virtual clusters with out a physical network?

I'd suggest you write a simple multicast client/sever application and debug the 
multicast layer in VMWare.  Multicast is a standard and even really bad 
switches should just revert to broadcast.

Lastly, what does tcpdump show is going on?

        -mjk (only builds physical clusters)


----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-12-02 18:29

Message:
Logged In: YES 
user_id=11722

Please re-read my initial description -- I am using vmware.

There have been reliable reports of others using real
machines that run into the same problems.  I'm quite sure
that they were all using multicast capable switches.

But even if the switch was the problem, wouldn't ganglia
consistently fail?  As opposed to inconsistent behavior?

----------------------------------------------------------------------

Comment By: Federico David Sacerdoti (sacerdoti)
Date: 2002-12-02 11:24

Message:
Logged In: YES 
user_id=581045

I hear what you are saying about the intermittant failure. However, if you 
could positively verify that your ethernet switch does support IP 
multicast and IGMP, and that those features are turned on, it would help 
us come up with further tests.

Run a crossover cable between a frontend and a compute node, so as 
to bypass the switch, and test again. Do you see the same problem?

I dont believe it is a ganglia software problem, as we here at SDSC have 
10+ clusters running ganglia 2.5.1 correctly. We have never seen this 
problem except during initial cluster setup, and the culprit was always 
the switch.

Good luck,
Federico

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-12-01 23:18

Message:
Logged In: YES 
user_id=11722

It does not explain, however, why *sometimes* it works, and
sometimes it doesn't. 

For example: after a reboot (where the gmond's are
automatically started), we see the one-and-only-one ping
behavior.  But if I go manually restart the gmonds, all is
well (i.e., pings happen continuously and regularly).

Any other ideas?

----------------------------------------------------------------------

Comment By: Federico David Sacerdoti (sacerdoti)
Date: 2002-11-30 20:12

Message:
Logged In: YES 
user_id=581045

It sounds like a problem we have seen before. In our case, the ethernet 
switch did not support multicast and the IGMP protocol. It would stifle 
all multicast packets except the first one, and we would see the same 
behavior you describe.

Hope this helps,
Federico

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-11-29 10:21

Message:
Logged In: YES 
user_id=11722

It's been about 2 weeks -- does anyone have any insight on
this problem?

Thanks.

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2002-11-16 18:00

Message:
Logged In: YES 
user_id=11722

Correction: this is actually ganglia 2.5.1, not 2.5.0. 
Sorry for the confusion.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=434892&aid=639482&group_id=43021

Reply via email to