Hi. Sincere apologies for the delay in following up on this. We are now
able to share the details of this incident, which are as follows.

On May 26 around 1:56 PM PT, some or all pods (and the containers running
in them) in some GKE clusters in the us-central1 region were forcibly
restarted. The root cause was packet loss caused by a temporary GCP
networking problem, which caused the GKE masters to miss some or all of
their nodes' heartbeat messages for long enough to think the nodes were
down. When the affected nodes became reachable from the master again, the
nodes terminated their pods as expected. The pods that were running on
those nodes, if they were managed by a controller (e.g.
ReplicationController), were rescheduled either after the node was declared
dead (in the case where there were other nodes in the cluster with free
resources) or when the node became reachable again (if there were no other
nodes available to run the replacement pods).

Since the incident, we have taken measures to make GKE and Kubernetes more
resilient to correlated node failure (e.g. PR #25571), and are working on
additional protections that will be included in the 1.4 release (see issue
#28832).


On Thu, Jun 9, 2016 at 12:52 PM, Zaar Hai <[email protected]> wrote:

> Thanks for sharing. It's good to know that the problem is being worked on.
> On 8 Jun 2016 02:29, "'Daniel Smith' via Containers at Google" <
> [email protected]> wrote:
>
>> We're aware of this issue and are preparing an incident report.
>>
>> You'll have to wait for that for details about the particular trigger in
>> this case, but #24200
>> <https://github.com/kubernetes/kubernetes/issues/24200> is the basic
>> problem. A partial amelioration
>> <https://github.com/kubernetes/kubernetes/pull/25571> is already in 1.3.
>> At the moment we believe only a single zone in a single region was affected.
>>
>> On Sat, Jun 4, 2016 at 12:46 AM, Zaar Hai <[email protected]> wrote:
>>
>>> We opened a ticket there. I'll update this thread if something
>>> interesting pops up.
>>>
>>> Multi zone k8s will arrive only in 1.4 AFAIK.
>>> On 4 Jun 2016 01:55, "Chris Hiestand" <[email protected]> wrote:
>>>
>>>> Seems like something went down. Even if the master went down, your pods
>>>> shouldn't have so that is disconcerting. So perhaps there was an outage
>>>> effecting your master and nodes. In my limited experience smaller outages
>>>> or problems in GCP might not get reported on the cloud status page.
>>>>
>>>> And I imagine these were all in one zone. I wonder if ubernetes lite
>>>> (additional-zone) would have mitigated the problem.
>>>>
>>>> To find out more, you'd probably need to pay for GCP support.
>>>>
>>>> --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "Containers at Google" group.
>>>> To unsubscribe from this topic, visit
>>>> https://groups.google.com/d/topic/google-containers/AB8MEDiLSik/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>> To post to this group, send email to [email protected]
>>>> .
>>>> Visit this group at https://groups.google.com/group/google-containers.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Containers at Google" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/google-containers.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Containers at Google" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/google-containers/AB8MEDiLSik/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/google-containers.
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Containers at Google" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/google-containers.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Containers at Google" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/google-containers.
For more options, visit https://groups.google.com/d/optout.

Reply via email to