Hi all, as we have exposed in previous architecture meetings, the devops team is working on the setup of GPII cloud using Kubernetes on AWS.
The actual setup uses a defined number of CouchDB instances that work all together as a single Cluster. Kubernetes, using a AWS service that provides load balancing, sets a endpoint where the preferences server can connect to the database. The load balancer redirects each request to one of the instances of the CouchDB cluster randomly. This setup allow us to scale up and down the database. Also, Kubernetes provides a health system that checks that every CouchDB instance is "working properly". The health checks are based on http requests to a particular url of each instance, and if the return code is 2XX Kubernetes will understand that the node is working so it will be in the pool of the choices to manage database requests. After some testing we have realized of some corner cases where one instance of CouchDB can not be working right and some requests are received, sending either code errors or connections timed out. For example, imagine that a process of the CouchDB dies, Kubernetes will take some time to realize that the instance has died and to take it off the pool because the health checks are fired at defined time intervals. Meanwhile, all the reads to that instance will fail. A similar issue happens when we scale up the cluster and we add new CouchDB instances. If an instance is ready to work, Kubernetes will put it in the pool of choices for the load balancer, despite the database is not fully synchronized. So the first reads to that instance will fail with 5XX or 4XX errors as well. The preferences server returns 5XX errors when the database is shutting down, "Failing to resume callback for request xxxxxxx-xxxxxx which has already concluded" and returns an empty reply if is not able to reach the instance, and sometimes the process hangs until the connection is reestablished. We are using a cluster with 3 instances of CouchDB and we are talking about only a few seconds of drifting before the database service is reestablished. Is this the expected behavior?, is reasonable to have these errors when there is a problem at the database? We know that we have talked before of other kind of errors (power outages, network connection errors,..) but we don't remember if we have discussed something similar to the above. So, the devops team would like to hear any concerns or thoughts about these issues in order to take them in to account for the future deployments. -- Alfredo _______________________________________________ Architecture mailing list [email protected] https://lists.gpii.net/mailman/listinfo/architecture
