Hi, I have been implementing a possible solution for this issue, and although I have not finished yet, I would like to kindly ask for comments.
I created some Helm charts to explain and reproduce the problem, if you are interested they are here: https://github.com/alb3rtobr/geode-cloudnative-wan-replication The solution consists on adding to ServerLocation the id of the member hosting the server, to allow to differentiate two or more gateway receivers with the same ip but that are in different locations. I verified that this change fixes the problem. After that, I have been working on fixing issues with the existing tests. In the meanwhile, it will be useful to get some feedback about the solution, specially if there are impacts I have not considered yet (maybe they are the reason for the failing tests Im currently working on). The code can be found on this PR: https://github.com/apache/geode/pull/4489 Thanks in advance! Alberto B. ________________________________ De: Anilkumar Gingade <aging...@pivotal.io> Enviado: viernes, 6 de diciembre de 2019 18:56 Para: geode <dev@geode.apache.org> Cc: Charlie Black <cbl...@pivotal.io> Asunto: Re: WAN replication issue in cloud native environments Alberto, Can you please file a JIRA ticket for this. This could come up often as more and more deployments move to K8s. -Anil. On Fri, Dec 6, 2019 at 8:33 AM Sai Boorlagadda <sai.boorlaga...@gmail.com> wrote: > > if one gw receiver stops, the locator will publish to any remote locator > that there are no receivers up. > > I am not sure if locators proactively update remote locators about change > in receivers list rather I think the senders figures this out on connection > issues. > But I see the problem that local-site locators have only one member in the > list of receivers that they maintain as all receivers register with a > single <hostname:port> address. > > One idea I had earlier is to statically set receivers list to locators > (just like remote-locators property) which are exchanged with gw-senders. > This way we can introduce a boolean flag to turn off wan discovery and use > the statically configured addresses. This can be also useful for > remote-locators if they are behind a service. > > Sai > > On Thu, Dec 5, 2019 at 2:33 AM Alberto Bustamante Reyes > <alberto.bustamante.re...@est.tech> wrote: > > > Thanks Charlie, but the issue is not about connectivity. Summarizing the > > issue, the problem is that if you have two or more gw receivers that are > > started with the same value of "hostname-for-senders", "start-port" and > > "end-port" (being "start-port" and "end-port" equal) parameters, if one > gw > > receiver stops, the locator will publish to any remote locator that there > > are no receivers up. > > > > And this use case is likely to happen on cloud-native environments, as > > described. > > > > BR/ > > > > Alberto B. > > ________________________________ > > De: Charlie Black <cbl...@pivotal.io> > > Enviado: miércoles, 4 de diciembre de 2019 18:11 > > Para: dev@geode.apache.org <dev@geode.apache.org> > > Asunto: Re: WAN replication issue in cloud native environments > > > > Alberto, > > > > Something else to think about SNI based routing. I believe Mario might > be > > working on adding SNI to Geode - he at least had a proposal that he > > e-mailed out. > > > > Basics are the destination host is in the SNI field and the proxy can > > inspect and route the request to the right service instance. Plus we > > have the option to not terminate the SSL at the proxy. > > > > Full disclosure - I haven't tried out SNI based routing myself and it is > > something that I thought could work as I was reading about it. From the > > whiteboard I have done I think this will do ingress and egress just fine. > > Potentially easier then port mapping and `hostname for clients` playing > > around. > > > > Just something to think about. > > > > Charlie > > > > > > On Wed, Dec 4, 2019 at 3:19 AM Alberto Bustamante Reyes > > <alberto.bustamante.re...@est.tech> wrote: > > > > > Hi Jacob, > > > > > > Yes,we are using LoadBalancer service type. But note the problem is not > > > the transport layer but on Geode as GW senders are complaining > > > “sender-2-parallel : Could not connect due to: There are no active > > > servers.” when one of the servers in the receiving cluster is killed. > > > > > > So, there is still one server alive in the receiving cluster but GW > > sender > > > does not know it and the locator is not able to inform about its > > existence. > > > Looking at the code it seems internal data structures (maps) holding > the > > > profiles use object whose equality check relies only on hostname and > > port. > > > This makes it impossible to differentiate servers when the same > > > “hostname-for-senders” and port are used. When the killed server comes > > back > > > up, the locator profiles are updated (internal map back to size()=1 > > > although 2+ servers are there) and GW senders happily reconnect. > > > > > > The solution with the Geode as-is would be to expose each GW receiver > on > > a > > > different port outside of k8s cluster, this includes creating N > > Kubernetes > > > services for N GW receivers in addition to updating the service mesh > > > configuration (if it is used, firewalls etc…). Declarative nature of > > > kubernetes means we must know the ports in advance hence start-port and > > > end-port when creating each GW receiver must be equal and we should > have > > > some well-known > > > algorithm when creating GW receivers across servers. For example: > > server-0 > > > port 5000, server-1 port 5001, server-2 port 5002 etc…. So, all GW > > > receivers must be wired individually and we must turn off Geode’s > random > > > port allocation. > > > > > > But we are exploring the possibility for Geode to handle this > > cloud-native > > > configuration a bit better. Locators should be capable of holding GW > > > receiver information although they are hidden behind same hostname and > > port. > > > This is a code change in Geode and we would like to have community > > opinion > > > on it. > > > > > > Some obvious impacts with the legacy behavior would be when locator > picks > > > a server on behalf of the client (GW sender in this case) it does so > > based > > > on the server load. When sender connects and considering all servers > are > > > using same VIP:PORT it is load balancer that will decide where the > > > connection will end up, but likely not on the one selected by locator. > So > > > here we ignore the locator instructions. Since GW senders normally do > not > > > create huge number of connections this probably shall not unbalance > > cluster > > > too much. But this is an impact worth considering. Custom load metrics > > > would also be ignored by GW senders. Opinions? > > > > > > Additional impact that comes to mind is GW sender load-balance command > > and > > > how it’s execution would be affected. > > > > > > Thanks! > > > > > > Alberto B. > > > > > > ________________________________ > > > De: Jacob Barrett <jbarr...@pivotal.io> > > > Enviado: viernes, 29 de noviembre de 2019 13:06 > > > Para: dev@geode.apache.org <dev@geode.apache.org> > > > Asunto: Re: WAN replication issue in cloud native environments > > > > > > > > > > > > > On Nov 29, 2019, at 3:14 AM, Alberto Bustamante Reyes > > > <alberto.bustamante.re...@est.tech> wrote: > > > > > > > > The reason for such a setup is deploying Geode cluster on a > Kubernetes > > > cluster where all GW receivers are reachable from the outside world on > > the > > > same VIP and port. > > > > > > Are you using LoadBalancer Service type? > > > > > > > Other kinds of configuration (different hostname and/or different > port > > > for each GW receiver) are not cheap from OAM and resources perspective > in > > > cloud native environments and also limit some important use-cases (like > > > scaling). > > > > > > If you could somehow configure host and port for sender (code > > modification > > > required) would exposing each port through the LoadBalancer be too > > > expensive too? > > > > > > > The problem experienced is that shutting down one server is stopping > > > replication to this cluster until the server is up again. We suspect > this > > > is because Geode incorrectly assumes there are no more alive servers > when > > > just one of them is down (since they share hostname-for-senders and > > port). > > > > > > Sees like at the worst case when it tries to reconnect the LB should > give > > > it a live server and it think the single server is back up. > > > > > > -Jake > > > > > > > > > > -- > > Charlie Black | cbl...@pivotal.io > > >