Hi all,

as we have exposed in previous architecture meetings, the devops team
is working on the setup of GPII cloud using Kubernetes on AWS.

The actual setup uses a defined number of CouchDB instances that work
all together as a single Cluster. Kubernetes, using a AWS service that
provides load balancing, sets a endpoint where the preferences server
can connect to the database. The load balancer redirects each request
to one of the instances of the CouchDB cluster randomly. This setup
allow us to scale up and down the database.

Also, Kubernetes provides a health system that checks that every
CouchDB instance is "working properly". The health checks are based on
http requests to a particular url of each instance, and if the return
code is 2XX Kubernetes will understand that the node is working so it
will be in the pool of the choices to manage database requests.

After some testing we have realized of some corner cases where one
instance of CouchDB can not be working right and some requests are
received, sending either code errors or connections timed out. For
example, imagine that a process of the CouchDB dies, Kubernetes will
take some time to realize that the instance has died and to take it
off the pool because the health checks are fired at defined time
intervals. Meanwhile, all the reads to that instance will fail.

A similar issue happens when we scale up the cluster and we add new
CouchDB instances. If an instance is ready to work, Kubernetes will
put it in the pool of choices for the load balancer, despite the
database is not fully synchronized. So the first reads to that
instance will fail with 5XX or 4XX errors as well.

The preferences server returns 5XX errors when the database is
shutting down, "Failing to resume callback for request xxxxxxx-xxxxxx
which has already concluded" and returns an empty reply if is not able
to reach the instance, and sometimes the process hangs until the
connection is reestablished. We are using a cluster with 3 instances
of CouchDB and we are talking about only a few seconds of drifting
before the database service is reestablished. Is this the expected
behavior?, is reasonable to have these errors when there is a problem
at the database?

We know that we have talked before of other kind of errors (power
outages, network connection errors,..) but we don't remember if we
have discussed something similar to the above. So, the devops team
would like to hear any concerns or thoughts about these issues in
order to take them in to account for the future deployments.


-- 
Alfredo
_______________________________________________
Architecture mailing list
[email protected]
https://lists.gpii.net/mailman/listinfo/architecture

Reply via email to