Hi Alberto,

Thank you for the contributions to the Apache Geode project

Here are a few feedback and pointers that we came up with:
1. Right now looking at your solution, we can see that you are modifying
the class "ServerLocation" which is used to stored as a key in the
connectionMap in LocatorLoadSnapshot.
2. ServerLocation was modified to include the memberID to differentiate
each server with the same hostname-for-sender and same pair of start and
end ports.
3. As ServerLocation is transmitted, there were a lot of changes in terms
of serialization etc. and also modification in ops code.

Suggestion:
- Instead, can we create a new class that contains the memberID and
ServerLocation and that new class object is added as a key in the
connectionMap.
- When member leaves, only that entry is removed from the connectionMap.
- When the remote locator is requesting the receiver information we
continue sending the ServerLocation, that we extract from the newly created
class.

Advantage:
- No changes required in terms of serialization as we are still sending the
ServerLocation like before.
- No changes to the ops.
- No extra bits sent over the wire


Please do let us know what do you feel about this solution.

Regards
Naba






````

On Tue, Jan 21, 2020 at 7:01 AM Alberto Bustamante Reyes
<alberto.bustamante.re...@est.tech> wrote:

> Hi,
>
> I have been implementing a possible solution for this issue, and although
> I have not finished yet, I would like to kindly ask for comments.
>
> I created some Helm charts to explain and reproduce the problem, if you
> are interested they are here:
> https://github.com/alb3rtobr/geode-cloudnative-wan-replication
>
> The solution consists on adding to ServerLocation the id of the member
> hosting the server, to allow to differentiate two or more gateway receivers
> with the same ip but that are in different locations. I verified that this
> change fixes the problem.
>
> After that, I have been working on fixing issues with the existing tests.
> In the meanwhile, it will be useful to get some feedback about the
> solution, specially if there are impacts I have not considered yet (maybe
> they are the reason for the failing tests Im currently working on).
>
> The code can be found on this PR:
> https://github.com/apache/geode/pull/4489
>
> Thanks in advance!
>
> Alberto B.
>
>
> ________________________________
> De: Anilkumar Gingade <aging...@pivotal.io>
> Enviado: viernes, 6 de diciembre de 2019 18:56
> Para: geode <dev@geode.apache.org>
> Cc: Charlie Black <cbl...@pivotal.io>
> Asunto: Re: WAN replication issue in cloud native environments
>
> Alberto,
>
> Can you please file a JIRA ticket for this. This could come up often as
> more and more deployments move to K8s.
>
> -Anil.
>
>
> On Fri, Dec 6, 2019 at 8:33 AM Sai Boorlagadda <sai.boorlaga...@gmail.com>
> wrote:
>
> > > if one gw receiver stops, the locator will publish to any remote
> locator
> > that there are no receivers up.
> >
> > I am not sure if locators proactively update remote locators about change
> > in receivers list rather I think the senders figures this out on
> connection
> > issues.
> > But I see the problem that local-site locators have only one member in
> the
> > list of receivers that they maintain as all receivers register with a
> > single <hostname:port> address.
> >
> > One idea I had earlier is to statically set receivers list to locators
> > (just like remote-locators property) which are exchanged with gw-senders.
> > This way we can introduce a boolean flag to turn off wan discovery and
> use
> > the statically configured addresses. This can be also useful for
> > remote-locators if they are behind a service.
> >
> > Sai
> >
> > On Thu, Dec 5, 2019 at 2:33 AM Alberto Bustamante Reyes
> > <alberto.bustamante.re...@est.tech> wrote:
> >
> > > Thanks Charlie, but the issue is not about connectivity. Summarizing
> the
> > > issue, the problem is that if you have two or more gw receivers that
> are
> > > started with the same value of "hostname-for-senders", "start-port" and
> > > "end-port" (being "start-port" and "end-port" equal) parameters, if one
> > gw
> > > receiver stops, the locator will publish to any remote locator that
> there
> > > are no receivers up.
> > >
> > > And this use case is likely to happen on cloud-native environments, as
> > > described.
> > >
> > > BR/
> > >
> > > Alberto B.
> > > ________________________________
> > > De: Charlie Black <cbl...@pivotal.io>
> > > Enviado: miércoles, 4 de diciembre de 2019 18:11
> > > Para: dev@geode.apache.org <dev@geode.apache.org>
> > > Asunto: Re: WAN replication issue in cloud native environments
> > >
> > > Alberto,
> > >
> > > Something else to think about SNI based routing.   I believe Mario
> might
> > be
> > > working on adding SNI to Geode - he at least had a proposal that he
> > > e-mailed out.
> > >
> > > Basics are the destination host is in the SNI field and the proxy can
> > > inspect and route the request to the right service instance.     Plus
> we
> > > have the option to not terminate the SSL at the proxy.
> > >
> > > Full disclosure - I haven't tried out SNI based routing myself and it
> is
> > > something that I thought could work as I was reading about it.   From
> the
> > > whiteboard I have done I think this will do ingress and egress just
> fine.
> > > Potentially easier then port mapping and `hostname for clients` playing
> > > around.
> > >
> > > Just something to think about.
> > >
> > > Charlie
> > >
> > >
> > > On Wed, Dec 4, 2019 at 3:19 AM Alberto Bustamante Reyes
> > > <alberto.bustamante.re...@est.tech> wrote:
> > >
> > > > Hi Jacob,
> > > >
> > > > Yes,we are using LoadBalancer service type. But note the problem is
> not
> > > > the transport layer but on Geode as GW senders are complaining
> > > > “sender-2-parallel : Could not connect due to: There are no active
> > > > servers.” when one of the servers in the receiving cluster is killed.
> > > >
> > > > So, there is still one server alive in the receiving cluster but GW
> > > sender
> > > > does not know it and the locator is not able to inform about its
> > > existence.
> > > > Looking at the code it seems internal data structures (maps) holding
> > the
> > > > profiles use object whose equality check relies only on hostname and
> > > port.
> > > > This makes it impossible to differentiate servers when the same
> > > > “hostname-for-senders” and port are used. When the killed server
> comes
> > > back
> > > > up, the locator profiles are updated (internal map back to size()=1
> > > > although 2+ servers are there) and GW senders happily reconnect.
> > > >
> > > > The solution with the Geode as-is would be to expose each GW receiver
> > on
> > > a
> > > > different port outside of k8s cluster, this includes creating N
> > > Kubernetes
> > > > services for N GW receivers in addition to updating the service mesh
> > > > configuration (if it is used, firewalls etc…). Declarative nature of
> > > > kubernetes means we must know the ports in advance hence start-port
> and
> > > > end-port when creating each GW receiver must be equal and we should
> > have
> > > > some well-known
> > > > algorithm when creating GW receivers across servers. For example:
> > > server-0
> > > > port 5000, server-1 port 5001, server-2 port 5002 etc…. So, all GW
> > > > receivers must be wired individually and we must turn off Geode’s
> > random
> > > > port allocation.
> > > >
> > > > But we are exploring the possibility for Geode to handle this
> > > cloud-native
> > > > configuration a bit better. Locators should be capable of holding GW
> > > > receiver information although they are hidden behind same hostname
> and
> > > port.
> > > > This is a code change in Geode and we would like to have community
> > > opinion
> > > > on it.
> > > >
> > > > Some obvious impacts with the legacy behavior would be when locator
> > picks
> > > > a server on behalf of the client (GW sender in this case) it does so
> > > based
> > > >  on the server load. When sender connects and considering all servers
> > are
> > > > using same VIP:PORT it is load balancer that will decide where the
> > > > connection will end up, but likely not on the one selected by
> locator.
> > So
> > > > here we ignore the locator instructions. Since GW senders normally do
> > not
> > > > create huge number of connections this probably shall not unbalance
> > > cluster
> > > > too much. But this is an impact worth considering. Custom load
> metrics
> > > > would also be ignored by GW senders. Opinions?
> > > >
> > > > Additional impact that comes to mind is GW sender load-balance
> command
> > > and
> > > > how it’s execution would be affected.
> > > >
> > > > Thanks!
> > > >
> > > > Alberto B.
> > > >
> > > > ________________________________
> > > > De: Jacob Barrett <jbarr...@pivotal.io>
> > > > Enviado: viernes, 29 de noviembre de 2019 13:06
> > > > Para: dev@geode.apache.org <dev@geode.apache.org>
> > > > Asunto: Re: WAN replication issue in cloud native environments
> > > >
> > > >
> > > >
> > > > > On Nov 29, 2019, at 3:14 AM, Alberto Bustamante Reyes
> > > > <alberto.bustamante.re...@est.tech> wrote:
> > > > >
> > > > > The reason for such a setup is deploying Geode cluster on a
> > Kubernetes
> > > > cluster where all GW receivers are reachable from the outside world
> on
> > > the
> > > > same VIP and port.
> > > >
> > > > Are you using LoadBalancer Service type?
> > > >
> > > > > Other kinds of configuration (different hostname and/or different
> > port
> > > > for each GW receiver) are not cheap from OAM and resources
> perspective
> > in
> > > > cloud native environments and also limit some important use-cases
> (like
> > > > scaling).
> > > >
> > > > If you could somehow configure host and port for sender (code
> > > modification
> > > > required) would exposing each port through the LoadBalancer be too
> > > > expensive too?
> > > >
> > > > > The problem experienced is that shutting down one server is
> stopping
> > > > replication to this cluster until the server is up again. We suspect
> > this
> > > > is because Geode incorrectly assumes there are no more alive servers
> > when
> > > > just one of them is down (since they share hostname-for-senders and
> > > port).
> > > >
> > > > Sees like at the worst case when it tries to reconnect the LB should
> > give
> > > > it a live server and it think the single server is back up.
> > > >
> > > > -Jake
> > > >
> > > >
> > >
> > > --
> > > Charlie Black | cbl...@pivotal.io
> > >
> >
>

Reply via email to