RE: WAN replication issue in cloud native environments

Alberto Bustamante Reyes Tue, 03 Mar 2020 03:44:57 -0800

Hi Bruce,

Thanks for your comments, but we are not planning to use TLS, so Im afraid the 
PR you are working on will not solve this problem.


The origin of this issue is that we would like to be able to configure all gw 
receivers with the same "hostname-for-senders" value. The reason is that we 
will run a multisite Geode cluster, having each site on a different cloud 
environment, so using just one hostname makes configuration much more easier.

When we tried to configure the cluster in this way, we experienced an issue 
with the replication. Using the same hostname-for-senders parameter causes that 
different servers have equals ServerLocation objects, so if one receiver is 
down, the others are considered down too. With the change suggested by Jacob 
this problem is solved, and replication works fine.

We are currently working on other issue related to this change: gw senders 
pings are not reaching the gw receivers, so ClientHealthMonitor closes the 
connections. I saw that the ping tasks are created by ServerLocation, so I have 
tried to solve the issue by changing it to be done by Endpoint. This change is 
not finished yet, as in its current status it causes the closing of connections 
from gw servers to gw receivers every 5 seconds.

Why you dont like the idea of using the InternalDistributedMember for 
distinguish server locations? Are you thinking about other alternative? In this 
use case, two different gw receivers will have the same ServerLocation, so we 
need to distinguish them.

BR/

Alberto B.

________________________________
De: Bruce Schuchardt <bschucha...@pivotal.io>
Enviado: lunes, 2 de marzo de 2020 20:20
Para: dev@geode.apache.org <dev@geode.apache.org>; Jacob Barrett 
<jbarr...@pivotal.io>
Cc: Anilkumar Gingade <aging...@pivotal.io>; Charlie Black <cbl...@pivotal.io>
Asunto: Re: WAN replication issue in cloud native environments

I'm coming to this conversation late and probably am missing a lot of context.  
Is the point of this to be to direct senders to some common gateway that all of 
the gateway receivers are configured to advertise?  I've been working on a PR 
to support redirection of connections for client/server and gateway 
communications to a common address and put the destination host name in the 
SNIHostName TLS parameter.  Then you won't have to tell servers about the 
common host name - just tell clients what the gateway is and they'll connect to 
it & tell it what the target host name is via the SNIHostName.  However, that 
only works if SSL is enabled.

PR 4743 is a step toward this approach and changes TcpClient and SocketCreator 
to take an unresolved host address.  After this is merged another change will 
allow folks to set a gateway host/port that will be used to form connections 
and insert the destination hostname into the SNIHostName SSLParameter.

I would really like us to avoid including InternalDistributedMembers in 
equality checks for server-locations.  To-date we've only held these 
identifiers in Endpoints and other places for debugging purposes and have used 
ServerLocation to identify servers.

On 1/27/20, 8:56 AM, "Alberto Bustamante Reyes" 
<alberto.bustamante.re...@est.tech> wrote:

    Hi again,

    Status update: the simplification of the maps suggested by Jacob made 
useless the new proposed class containing the ServerLocation and the member id. 
With this refactoring, replication is working in the scenario we have been 
discussing in this conversation. Thats great, and I think the code can be 
merged into develop if there are no extra comments in the PR.

    But this does not mean we can say that Geode is able to work properly when 
using gw receivers with the same ip + port. We have seen that when working with 
this configuration, there is a problem with the pings sent from gw senders 
(that acts as clients) to the gw receivers (servers). The pings are reaching 
just one of the receivers, so the sender-receiver connection is finally closed 
by the ClientHealthMonitor.

    Do you have any suggestion about how to handle this issue? My first idea 
was to identify where the connection is created, to check if the sender could 
be aware in some way there are more than one server to which the ping should be 
sent, but Im not sure if it could be possible. Or if the alternative could be 
to change the ClientHealthMonitor to be "clever" enough to not close 
connections in this case. Any comment is welcome 🙂

    Thanks,

    Alberto B.

    ________________________________
    De: Jacob Barrett <jbarr...@pivotal.io>
    Enviado: miércoles, 22 de enero de 2020 19:01
    Para: Alberto Bustamante Reyes <alberto.bustamante.re...@est.tech>
    Cc: dev@geode.apache.org <dev@geode.apache.org>; Anilkumar Gingade 
<aging...@pivotal.io>; Charlie Black <cbl...@pivotal.io>
    Asunto: Re: WAN replication issue in cloud native environments



    On Jan 22, 2020, at 9:51 AM, Alberto Bustamante Reyes 
<alberto.bustamante.re...@est.tech<mailto:alberto.bustamante.re...@est.tech>> 
wrote:

    Thanks Naba & Jacob for your comments!



    @Naba: I have been implementing a solution as you suggested, and I think it 
would be convenient if the client knows the memberId of the server it is 
connected to.

    (current code is here: https://github.com/apache/geode/pull/4616 )

    For example, in:

    LocatorLoadSnapshot::getReplacementServerForConnection(ServerLocation 
currentServer, String group, Set<ServerLocation> excludedServers)

    In this method, client has sent the ServerLocation , but if that object 
does not contain the memberId, I dont see how to guarantee that the replacement 
that will be returned is not the same server the client is currently connected.
    Inside that method, this other method is called:


    Given that your setup is masquerading multiple members behind the same host 
and port (ServerLocation) it doesn’t matter. When the pool opens a new socket 
to the replacement server it will be to the shared hostname and port and the 
Kubenetes service at that host and port will just pick a backend host. In the 
solution we suggested we preserved that behavior since the k8s service can’t 
determine which backend member to route the connection to based on the member 
id.


    LocatorLoadSnapshot::isCurrentServerMostLoaded(currentServer, groupServers)

    where groupServers is a "Map<ServerLocationAndMemberId, LoadHolder>" 
object. If the keys of that map have the same host and port, they are only 
different on the memberId. But as you dont know it (you just have currentServer 
which contains host and port), you cannot get the correct LoadHolder value, so 
you cannot know if your server is the most loaded.

    Again, given your use case the behavior of this method is lost when a new 
connection is establish by the pool through the shared hostname anyway.

    @Jacob: I think the solution finally implies that client have to know the 
memberId, I think we could simplify the maps.

    The client isn’t keeping these load maps, the locator is, and the locator 
knows all the member ids. The client end only needs to know the host/port 
combination. In your example where the wan replication (a client to the remote 
cluster) connects to the shared host/port service and get randomly routed to 
one of the backend servers in that service.

    All of this locator balancing code is unnecessarily in this model where 
something else is choosing the final destination. The goal of our proposed 
changes was to recognize that all we need is to make sure the locator keeps the 
shared ServerLocation alive in its responses to clients by tracking the members 
associated and reducing that set to the set of unit ServerLocations. In your 
case that will always reduce to 1 ServerLocation for N number of members, as 
long as 1 member is still up.

    -Jake

RE: WAN replication issue in cloud native environments

Reply via email to