Hi,

In the Rio project there is the concept of a fault detection handler (FDH), that is used to determine the reachability of a service. The service provides the FDH, clients use it to determine whether a service is indeed at the far end of the connection. In the case of an event producer this aligns itself with option (1) below.

In general, the FDH approach has served quite well, provides a pluggable approach that can be built to use any technique (ping, heartbeat multicast, lease ...), specific to each service (as needed).

Regards

Dennis

On May 29, 2007, at 349PM, Dan Creswell wrote:

Hi all,

It started with a discussion under the Javaspaces.notify() not reliable conversation and I've now had a bit more time to formulate my thoughts.

Without this extra feature we do something like the following in the client:

(1)     Setup a watchdog timer with a suitable expiry
(2)     On receiving a remote event, reset our watchdog timer
(3)     If timer expires, check to see if our source is still alive, check
to see if we might've missed an event.

What's being proposed, if I understand correctly is the the source if
it's alive and hasn't generated events in a particular time period
confirm that by posting a SourceAliveRemoteEvent to the client
confirming this.

This would potentially change the above client code to reset the timer
on just a SourceAliveRemoteEvent (SARE).

Things of note:

(1)     The original solution places the responsibility and load on the
client (bar the pinging of the server). This naturally scales out quite well as the server only has to respond to pings and chances are a client only maintains timers for a few services. If client timeouts are tuned
appropriately to event frequency/typical pause, pings will be rare.

(2) The new solution places much of the responsibility with the server.
 I believe there may be a scaling problem here.  In contrast to the
client-side approach a server might have a large number of clients to
cope with.  This potentially means the server has significant load
tracking a large number of timer events for all it's clients and posting
SARE's in addition to what it already does.

(3)     The only difference between old and new approach from a client
coding perspective is what causes a reset of the watchdog timer.

(4)     SARE's like any other event can be lost - if it's lost the client
watchdog will trigger just as it would in the old approach given
sufficient time between RemoteEvents.

(5) If the source has sent events but they've been lost it won't send an
SARE and, again client watchdog will timeout and ping.

Based on the above it seems to me that whilst an SARE might save a few
pings there's additional complexity and greater server load.  If I've
missed some subtleties, please shout because right now I don't see
enough benefit in this to justify the "pain".

Thoughts?

Dan.


Reply via email to