RE: WAN replication issue in cloud native environments

Alberto Bustamante Reyes Wed, 04 Dec 2019 03:20:12 -0800

Hi Jacob,

Yes,we are using LoadBalancer service type. But note the problem is not the 
transport layer but on Geode as GW senders are complaining “sender-2-parallel : 
Could not connect due to: There are no active servers.” when one of the servers 
in the receiving cluster is killed.

So, there is still one server alive in the receiving cluster but GW sender does 
not know it and the locator is not able to inform about its existence. Looking 
at the code it seems internal data structures (maps) holding the profiles use 
object whose equality check relies only on hostname and port. This makes it 
impossible to differentiate servers when the same “hostname-for-senders” and 
port are used. When the killed server comes back up, the locator profiles are 
updated (internal map back to size()=1 although 2+ servers are there) and GW 
senders happily reconnect.

The solution with the Geode as-is would be to expose each GW receiver on a 
different port outside of k8s cluster, this includes creating N Kubernetes 
services for N GW receivers in addition to updating the service mesh 
configuration (if it is used, firewalls etc…). Declarative nature of kubernetes 
means we must know the ports in advance hence start-port and end-port when 
creating each GW receiver must be equal and we should have some well-known
algorithm when creating GW receivers across servers. For example: server-0 port 
5000, server-1 port 5001, server-2 port 5002 etc…. So, all GW receivers must be 
wired individually and we must turn off Geode’s random port allocation.

But we are exploring the possibility for Geode to handle this cloud-native 
configuration a bit better. Locators should be capable of holding GW receiver 
information although they are hidden behind same hostname and port.
This is a code change in Geode and we would like to have community opinion on 
it.

Some obvious impacts with the legacy behavior would be when locator picks a 
server on behalf of the client (GW sender in this case) it does so based
 on the server load. When sender connects and considering all servers are using 
same VIP:PORT it is load balancer that will decide where the connection will 
end up, but likely not on the one selected by locator. So here we ignore the 
locator instructions. Since GW senders normally do not create huge number of 
connections this probably shall not unbalance cluster too much. But this is an 
impact worth considering. Custom load metrics would also be ignored by GW 
senders. Opinions?

Additional impact that comes to mind is GW sender load-balance command and how 
it’s execution would be affected.

Thanks!

Alberto B.

________________________________
De: Jacob Barrett <jbarr...@pivotal.io>
Enviado: viernes, 29 de noviembre de 2019 13:06
Para: dev@geode.apache.org <dev@geode.apache.org>
Asunto: Re: WAN replication issue in cloud native environments

> On Nov 29, 2019, at 3:14 AM, Alberto Bustamante Reyes 
> <alberto.bustamante.re...@est.tech> wrote:
>
> The reason for such a setup is deploying Geode cluster on a Kubernetes 
> cluster where all GW receivers are reachable from the outside world on the 
> same VIP and port.

Are you using LoadBalancer Service type?

> Other kinds of configuration (different hostname and/or different port for 
> each GW receiver) are not cheap from OAM and resources perspective in cloud 
> native environments and also limit some important use-cases (like scaling).

If you could somehow configure host and port for sender (code modification 
required) would exposing each port through the LoadBalancer be too expensive 
too?

> The problem experienced is that shutting down one server is stopping 
> replication to this cluster until the server is up again. We suspect this is 
> because Geode incorrectly assumes there are no more alive servers when just 
> one of them is down (since they share hostname-for-senders and port).

Sees like at the worst case when it tries to reconnect the LB should give it a 
live server and it think the single server is back up.

-Jake

RE: WAN replication issue in cloud native environments

Reply via email to