This has been submitted as a ticket on Kubernetes/Examples:  
https://github.com/kubernetes/examples/issues/149#issuecomment-347705608


Running Cassandra in AWS but I imagine the same applies elsewhere (Kubernetes 
1.7.4). Consider:

3 Cassandra nodes as defined by the Statefulset.

cassandra-0 is in AZ (zone) "a"
cassandra-1 is in AZ  (zone)  "b"
cassandra-2 is in AZ  (zone)  "c"

AZ stands for a Availability Zone (same as Google Zone I heard). It's like a 
Cassandra rack (although we don't have Kubernetes Cassandra snitch so for 
Cassandra it looks like everything is in one rack)

I shutdown all Kubernetes nodes in AZ "a" and prevent creating new nodes in AZ 
"a" (remove the zone from the AWS autoscaling group)
Therefore cassandra-0 cannot start as its EBS Volume (PV) is tight to the AZ 
"a".

This simulates 1 Availability Zone (zone) down.

During this time cassandra-2 fails and needs to be restarted elsewhere. Or the 
host died. Or autoscaling killed the host node.

A statefulset won't self-heal cassandra-2 and also prevents creating further 
cassandra-3, cassandra-4, etc.
If we also loose cassandra-1 (maybe the host is gone because of AWS/Cloud 
autoscaling), then we have no Cassandra node left. The more time the first zone 
is down, the more we risk having to re-create the other Cassandra nodes (infra 
is elastic and hosts can go wrong), which is not working unless AZ "a" goes 
back online. For a 30minutes outage of AZ "a", it will be fine. But then the 
clock is ticking...

>From Statefulset doc:
"Before a scaling operation is applied to a Pod, all of its predecessors must 
be Running and Ready."

Some questions

Is that expected behaviour?
shouldn't we allow Cassandra to scale nodes in other AZ even if the first one 
is down?
Is Statefulset the right choice then? would deployment + replica work better? 
(as far as I know we need the seeds to start first, thereafter we don't care 
about the order).
Should we use 3 statefulset (1 per zone/rack) joining the same cluster? How to 
do that? How do Cassandra node recognise they are in the same cluster? Is there 
any issue with this approach?

-- 
You received this message because you are subscribed to the Google Groups 
"Kubernetes user discussion and Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to kubernetes-users+unsubscr...@googlegroups.com.
To post to this group, send email to kubernetes-users@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to