Hi Alfredo, Undoubtedly the GPII needs to be, in some way, more resilient in the face of transient outages. However, it's not fully clear to me what the best approach is and where within our system this responsibility should be placed.
To start, can you elaborate on the thinking behind your design and approach for this cluster? Which load balancer are you using (ELB or other?) and how does your use of it relate to Kubernetes' Ingress (if at all)? What kind of load balancing policies are you employing? To what extent can you tailor the load balancing policy to suit the kinds of issues that might arise in a CouchDB cluster? How are you monitoring the health of CouchDB, and are you using its status endpoint? What options are available at the load balancer level to redirect to a healthy instance if a request fails? What kinds of of application failure behaviours were you planning and designing for when you set this cluster up? Why did you choose this approach? I have no way to meaningfully evaluate the work you've done or the issue you're encountering, nor to plan an effective architecture approach, without greater insight into the thinking and planning that went into this implementation. Thanks very much, Colin > On Oct 24, 2017, at 8:56 AM, Alfredo Matas <[email protected]> > wrote: > > Hi all, > > as we have exposed in previous architecture meetings, the devops team > is working on the setup of GPII cloud using Kubernetes on AWS. > > The actual setup uses a defined number of CouchDB instances that work > all together as a single Cluster. Kubernetes, using a AWS service that > provides load balancing, sets a endpoint where the preferences server > can connect to the database. The load balancer redirects each request > to one of the instances of the CouchDB cluster randomly. This setup > allow us to scale up and down the database. > > Also, Kubernetes provides a health system that checks that every > CouchDB instance is "working properly". The health checks are based on > http requests to a particular url of each instance, and if the return > code is 2XX Kubernetes will understand that the node is working so it > will be in the pool of the choices to manage database requests. > > After some testing we have realized of some corner cases where one > instance of CouchDB can not be working right and some requests are > received, sending either code errors or connections timed out. For > example, imagine that a process of the CouchDB dies, Kubernetes will > take some time to realize that the instance has died and to take it > off the pool because the health checks are fired at defined time > intervals. Meanwhile, all the reads to that instance will fail. > > A similar issue happens when we scale up the cluster and we add new > CouchDB instances. If an instance is ready to work, Kubernetes will > put it in the pool of choices for the load balancer, despite the > database is not fully synchronized. So the first reads to that > instance will fail with 5XX or 4XX errors as well. > > The preferences server returns 5XX errors when the database is > shutting down, "Failing to resume callback for request xxxxxxx-xxxxxx > which has already concluded" and returns an empty reply if is not able > to reach the instance, and sometimes the process hangs until the > connection is reestablished. We are using a cluster with 3 instances > of CouchDB and we are talking about only a few seconds of drifting > before the database service is reestablished. Is this the expected > behavior?, is reasonable to have these errors when there is a problem > at the database? > > We know that we have talked before of other kind of errors (power > outages, network connection errors,..) but we don't remember if we > have discussed something similar to the above. So, the devops team > would like to hear any concerns or thoughts about these issues in > order to take them in to account for the future deployments. > > > -- > Alfredo > _______________________________________________ > Architecture mailing list > [email protected] > https://lists.gpii.net/mailman/listinfo/architecture _______________________________________________ Architecture mailing list [email protected] https://lists.gpii.net/mailman/listinfo/architecture
