Hi Colin On Tue, Oct 24, 2017 at 9:47 PM, Colin Clark <[email protected]> wrote: > Hi Alfredo, > > Undoubtedly the GPII needs to be, in some way, more resilient in the face of > transient outages. However, it's not fully clear to me what the best approach > is and where within our system this responsibility should be placed. > > To start, can you elaborate on the thinking behind your design and approach > for this cluster?
The main idea of GPII-2544 is to provide reliability, and to allow growth of read and write requests to the database without downtime via horizontal scaling. The work I've done is in this PR: https://github.com/gpii-ops/gpii-infra/pull/3 This code makes use of Kubernetes StatefulSets and Services to create CouchDB containers on demand. The Service allows communication between the containers in the StatefulSet (the CouchDB instances) and other containers inside Kubernetes (e.g. the preferences server). With only these two objects in Kubernetes we have all the instances of CouchDB running and allowing incoming data, but the instances are not configured to work as a cluster. To do so, we added an extra container that handles the configuration of each instance into a CouchDB cluster. This container is called "couchdiscover" and it adds new nodes to the cluster as each CouchDB instance starts. By default, Kubernetes understands that a container is ready when the applications inside it continue to run (as opposed to exiting non-zero). But in the case of CouchDB, the application can be running but not ready to accept data. That's why I added checks to test that the CouchDB server returns an "ok" from its status url (added in the 2.0 version). Only when the "readiness" test gets an "ok" does Kubernetes add that container to the load balancer pool. There is an extra check that tests if the container is alive -- the check is the same as the "readiness" check -- but it is used to detect if the container is working properly. If the "liveness" check gets a non-"ok" reply from the server, the container will be pulled from the pool and the load balancer until it is ready again. > Which load balancer are you using (ELB or other?) and how does your use of it > relate to Kubernetes' Ingress (if at all)? Iptables virtual IP load balancer. It is the default internal way that Kubernetes uses for the Services objects. We are not using Ingress controllers as we are not exposing the CouchDB instances to the Internet. > What kind of load balancing policies are you employing? To what extent can > you tailor the load balancing policy to suit the kinds of issues that might > arise in a CouchDB cluster? How are you monitoring the health of CouchDB, and > are you using its status endpoint? The current policy is round-robin load balancing. Kubernetes adds or removes a node from that list of available nodes based on the "readiness" probes. I'm monitoring the health of CouchDB by checking the "_up" url of each instance: http://docs.couchdb.org/en/2.1.1/whatsnew/2.0.html?highlight=_up#version-2-0-0 The problem occurs when a CouchDB instance fails. Kubernetes takes some time to recognize the failure, but once the failure is detected the instance is removed from the list of available endpoints and subsequent requests are sent to healthy instances. If we need to shutdown the CouchDB cluster and start it again (e.g. for invasive maintenance like restoring data from a backup/snapshot) we will have a similar situation where the preferences server would be making requests to an endpoint that is in a failure status or doesn't exist. The parameters that we use for the readiness/liveness probes are: initialDelaySeconds: 5 timeoutSeconds: 1 periodSeconds: 1 successThreshold: 1 failureThreshold: 1 As you can see, they are set to the minimum allowed, very aggressive health checks that don't completely solve the corner cases exposed. > > What options are available at the load balancer level to redirect to a > healthy instance if a request fails? What kinds of of application failure > behaviours were you planning and designing for when you set this cluster up? > Why did you choose this approach? AFAIK we don't have many additional options with the current strategy. The only tunable parameters of the internal load balancer are the health checks that I listed above. Failure behaviors I considered include the usual operations in a cluster: adding or removing a node without downtime, backup/restore of the entire database including reinitialization of the cluster. If you are talking about the approach of using a CouchDB cluster inside Kubernetes, I think it was a decision made by the entire group. In addition to the above, I'm discovering some unexpected behaviors of the CouchDB cluster as well as showing possible errors that we can have in production. I'd like to discuss how the entire system should react to mitigate them. The goal is to avoid surprises or panic when this kind of situations occurs in production. > > I have no way to meaningfully evaluate the work you've done or the issue > you're encountering, nor to plan an effective architecture approach, without > greater insight into the thinking and planning that went into this > implementation. I hope that the above lines help Thanks! > > Thanks very much, > > Colin > > >> On Oct 24, 2017, at 8:56 AM, Alfredo Matas <[email protected]> >> wrote: >> >> Hi all, >> >> as we have exposed in previous architecture meetings, the devops team >> is working on the setup of GPII cloud using Kubernetes on AWS. >> >> The actual setup uses a defined number of CouchDB instances that work >> all together as a single Cluster. Kubernetes, using a AWS service that >> provides load balancing, sets a endpoint where the preferences server >> can connect to the database. The load balancer redirects each request >> to one of the instances of the CouchDB cluster randomly. This setup >> allow us to scale up and down the database. >> >> Also, Kubernetes provides a health system that checks that every >> CouchDB instance is "working properly". The health checks are based on >> http requests to a particular url of each instance, and if the return >> code is 2XX Kubernetes will understand that the node is working so it >> will be in the pool of the choices to manage database requests. >> >> After some testing we have realized of some corner cases where one >> instance of CouchDB can not be working right and some requests are >> received, sending either code errors or connections timed out. For >> example, imagine that a process of the CouchDB dies, Kubernetes will >> take some time to realize that the instance has died and to take it >> off the pool because the health checks are fired at defined time >> intervals. Meanwhile, all the reads to that instance will fail. >> >> A similar issue happens when we scale up the cluster and we add new >> CouchDB instances. If an instance is ready to work, Kubernetes will >> put it in the pool of choices for the load balancer, despite the >> database is not fully synchronized. So the first reads to that >> instance will fail with 5XX or 4XX errors as well. >> >> The preferences server returns 5XX errors when the database is >> shutting down, "Failing to resume callback for request xxxxxxx-xxxxxx >> which has already concluded" and returns an empty reply if is not able >> to reach the instance, and sometimes the process hangs until the >> connection is reestablished. We are using a cluster with 3 instances >> of CouchDB and we are talking about only a few seconds of drifting >> before the database service is reestablished. Is this the expected >> behavior?, is reasonable to have these errors when there is a problem >> at the database? >> >> We know that we have talked before of other kind of errors (power >> outages, network connection errors,..) but we don't remember if we >> have discussed something similar to the above. So, the devops team >> would like to hear any concerns or thoughts about these issues in >> order to take them in to account for the future deployments. >> >> >> -- >> Alfredo >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://lists.gpii.net/mailman/listinfo/architecture > -- Alfredo _______________________________________________ Architecture mailing list [email protected] https://lists.gpii.net/mailman/listinfo/architecture
