Improving the loadbalancing support

Hubert, Eric Wed, 04 Jun 2008 12:27:09 -0700

Hi all,

today I would like to discuss some options to improve the load balancing
support of Apache Synapse. As not all of my ideas have settled, I may
miss some pieces of the current implementation and would like to get
some feedback about my ideas, I decided to not create a JIRA for that
immediately. Though, after our discussion I would like to summarize the
results in a JIRA.


1)
Where can I review the status of the endpoints of a loadbalance group?
It should be possible to query the status of each endpoint via JMX. It
should also be possible to get the number of configured as well as
active endpoints of a load balance group via JMX. This way it will be
possible to use some meaningful external monitoring. For example the
user could define an alert if only 2 nodes are left or the ratio of
available nodes is less then 20% or something like this.

2)
Another very useful feature would be the possibility to manually
deactivate an endpoint of a load balancing group. If I understand it
correctly right now you have to remove the endpoint from the group and
restart your server (or cluster gracefully). Not very nice. To implement
this, it might make sense to differ between three states: "active",
"deactivated" and "manually deactivated". A manually deactivated
endpoint can only be manually reactivated. Automatic retry will not be
used for endpoints in that state.

3)
Why did you choose the interpretation of a missing
suspendDurationOnFailure that it will never be recovered after a
failure? At least from my point of view this does not match my intuition
and expectations. Is this really a good default value? When does a user
ever want to have this effect? Do I understand this wrong, or does the
user have to restart the ESB to change the status back to "active"?

4)
A static, configurable value for suspendDurationOnFailure is better than
having a hardcoded value, but is also not optimal. The user has always
the problem that he tries to balance between different side effects
depending on the cause of the service outage. When you think about short
network instabilities and you have a small cluster (think of two nodes)
you are somehow forced to keep that check interval rather short. If then
suddenly a service fails for some other reason and a long period of
time, this has a negative impact on the performance, as the retries
happen to often.
It would be much better to use a dynamic approach with a changing check
interval. Start frequently (short interval of a few seconds) and
increase this up to a maximum value based on the number of tries. Maybe
one could come up with a general purpose function, where the user can
specify the arguments. This should allow preserving the existing
behaviour while also supporting better suited algorithms.

5)
When *all* nodes are inactive, the ESB currently creates a fault
immediately. I'm thinking whether this makes sense or not. Maybe it
would be best, if the user could decide between two options:
a) current behaviour
b) first try all inactive endpoints until either one endpoint works, or
all endpoints have been tried out once and only then issue a fault
I'm not sure about this one. But the following happened during a test of
a minimal service cluster with two nodes. The suspendDurationOnFailure
had been set to 60 seconds. The first node had been passive due to some
maintenance. So all requests have been served via the second node.
Suddently a short network outage happened. The second node was marked as
deactivated. It was reachable in the next second but the ESB marked it
as passive. Actually the whole system was down for one minute. So you
have to think about a shorter period of time for the check interval,
which again would be bad for the server which has been down for
maintenance. If the ESB would have done one additional round of retries,
it would have detected that the endpoint in fact is already up again.


Now I hope to receive a lot of comments and feedback. Maybe we can work
together to make improvements in this area.

Please point me to some existing functionality I may have missed!


Regards,
   Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Improving the loadbalancing support

Reply via email to