Hi Alfredo,

Undoubtedly the GPII needs to be, in some way, more resilient in the face of 
transient outages. However, it's not fully clear to me what the best approach 
is and where within our system this responsibility should be placed. 

To start, can you elaborate on the thinking behind your design and approach for 
this cluster? Which load balancer are you using (ELB or other?) and how does 
your use of it relate to Kubernetes' Ingress (if at all)? What kind of load 
balancing policies are you employing? To what extent can you tailor the load 
balancing policy to suit the kinds of issues that might arise in a CouchDB 
cluster? How are you monitoring the health of CouchDB, and are you using its 
status endpoint?

What options are available at the load balancer level to redirect to a healthy 
instance if a request fails? What kinds of of application failure behaviours 
were you planning and designing for when you set this cluster up? Why did you 
choose this approach?

I have no way to meaningfully evaluate the work you've done or the issue you're 
encountering, nor to plan an effective architecture approach, without greater 
insight into the thinking and planning that went into this implementation.

Thanks very much,

Colin


> On Oct 24, 2017, at 8:56 AM, Alfredo Matas <[email protected]> 
> wrote:
> 
> Hi all,
> 
> as we have exposed in previous architecture meetings, the devops team
> is working on the setup of GPII cloud using Kubernetes on AWS.
> 
> The actual setup uses a defined number of CouchDB instances that work
> all together as a single Cluster. Kubernetes, using a AWS service that
> provides load balancing, sets a endpoint where the preferences server
> can connect to the database. The load balancer redirects each request
> to one of the instances of the CouchDB cluster randomly. This setup
> allow us to scale up and down the database.
> 
> Also, Kubernetes provides a health system that checks that every
> CouchDB instance is "working properly". The health checks are based on
> http requests to a particular url of each instance, and if the return
> code is 2XX Kubernetes will understand that the node is working so it
> will be in the pool of the choices to manage database requests.
> 
> After some testing we have realized of some corner cases where one
> instance of CouchDB can not be working right and some requests are
> received, sending either code errors or connections timed out. For
> example, imagine that a process of the CouchDB dies, Kubernetes will
> take some time to realize that the instance has died and to take it
> off the pool because the health checks are fired at defined time
> intervals. Meanwhile, all the reads to that instance will fail.
> 
> A similar issue happens when we scale up the cluster and we add new
> CouchDB instances. If an instance is ready to work, Kubernetes will
> put it in the pool of choices for the load balancer, despite the
> database is not fully synchronized. So the first reads to that
> instance will fail with 5XX or 4XX errors as well.
> 
> The preferences server returns 5XX errors when the database is
> shutting down, "Failing to resume callback for request xxxxxxx-xxxxxx
> which has already concluded" and returns an empty reply if is not able
> to reach the instance, and sometimes the process hangs until the
> connection is reestablished. We are using a cluster with 3 instances
> of CouchDB and we are talking about only a few seconds of drifting
> before the database service is reestablished. Is this the expected
> behavior?, is reasonable to have these errors when there is a problem
> at the database?
> 
> We know that we have talked before of other kind of errors (power
> outages, network connection errors,..) but we don't remember if we
> have discussed something similar to the above. So, the devops team
> would like to hear any concerns or thoughts about these issues in
> order to take them in to account for the future deployments.
> 
> 
> -- 
> Alfredo
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://lists.gpii.net/mailman/listinfo/architecture

_______________________________________________
Architecture mailing list
[email protected]
https://lists.gpii.net/mailman/listinfo/architecture

Reply via email to