On Fri, Oct 27, 2017 at 2:17 PM, David Rosenstrauch <dar...@darose.net> wrote: > I'm trying to make sure that as I'm deploying new services on our cluster, > that failures/restarts get handled in a way that's most optimal for > resiliency/uptime. > > > I'm simplifying things a bit, but if a piece of code running inside a > container crashes, there's more or less 2 possibilities: 1) bug in the code > (and/or it's trying to process data that causes an error), or 2) problems > with the hardware/network (full disk, bad disk, network outage, etc.) If > the issue is #1, then it doesn't matter whether you restart the container or > the pod. But if the issue is #2, then restarting the pod (i.e., on another > host) would fix the problem, while restarting the container probably > wouldn't.
We automatically detect things like full disks and network outages and remove or repair those nodes. Are you working around a known problem or a hypothetical? > So I guess this is sort of alluding to a bigger question, then: does k8s > have any ability to detect if a host is having hardware problems and, if so, > avoid scheduling new pods on it, move pods off of it if their containers are > crashing, etc. yes. Node Problem Detector detects a number of issues and responds. GKE's NodeAutoRepair will automacially rebuild nodes when it detects problems. > I've done a lot of work with big data systems previously and, IIRC, Hadoop > (for example) used to employ procedures to detect if a disk was bad, if many > tasks on a particular node kept crashing, etc., and it would start to > blacklist those. My thinking was that k8s worked similarly - i.e., if all > containers in a pod terminated unsuccessfully, then terminate the pod; if a > particular node is having many pods terminated unsuccessfully, then stop > launching new pods on there, etc. Perhaps I'm misunderstanding / assuming > incorrectly though. We probably should have a crash-loop mode that kills the pod and lets the scheduler re-assess. AFAIK, we don't do that today, but it hasn't been a huge problem. > Thanks, > > DR > > > On 2017-10-27 4:35 pm, 'Tim Hockin' via Kubernetes user discussion and Q&A > wrote: >> >> What Rodrigo said - what problem are you trying to solve? >> >> The pod lifecycle is defined as restart-in-place, today. Nothing you >> can do inside your pod, except deleting it from the apiserver, will do >> what you asking. It doesn't seem too far fetched that a pod could >> exit and "ask for a different node", but we're not going there without >> a solid solid solid use case. >> >> On Fri, Oct 27, 2017 at 1:23 PM, Rodrigo Campos <rodrig...@gmail.com> >> wrote: >>> >>> I don't think it is configurable. >>> >>> But I don't really see what you are trying to solve, maybe there is >>> another >>> way to achieve it? If you are running a pod of a single container, what >>> is >>> the problem that the container is restarted when is appropriate instead >>> of >>> the whole pod? >>> >>> I mean, you would need to handle the case where some container in the pod >>> crashed or is stalled, right? The liveness probe will be done >>> periodically, >>> but until the next check is done, it can be hunged or something. So even >>> if >>> the whole pod is restarted, that problem is still there. And restarting >>> the >>> whole pod won't solve that. So probably my guess is not correct about >>> what >>> you are trying to solve. >>> >>> So, sorry, but can I ask again what is the problem you want to address? >>> :) >>> >>> >>> On Friday, October 27, 2017, David Rosenstrauch <dar...@darose.net> >>> wrote: >>>> >>>> >>>> Was speaking to our admin here, and he offered that running a health >>>> check >>>> container inside the same pod might work. Anyone agree that that would >>>> be a >>>> good (or even preferred) approach? >>>> >>>> Thanks, >>>> >>>> DR >>>> >>>> On 2017-10-27 11:41 am, David Rosenstrauch wrote: >>>>> >>>>> >>>>> I have a pod which runs a single container. The pod is being run >>>>> under a ReplicaSet (which starts a new pod to replace a pod that's >>>>> terminated). >>>>> >>>>> >>>>> What I'm seeing is that when the container within that pod terminates, >>>>> instead of the pod terminating too, the pod stays alive, and just >>>>> restarts the container in it. However I'm thinking that what would >>>>> make more sense would be for the entire pod to terminate in this >>>>> situation, and then another would automatically start to replace it. >>>>> >>>>> Does this seem sensible? If so, how would one accomplish this with >>>>> k8s? Changing the restart policy setting doesn't seem to be an >>>>> option. The restart policy (e.g. Restart=Always) seems to apply only >>>>> to whether to restart a pod; the decision about whether to restart a >>>>> container in a pod doesn't seem to be configurable. (At least not >>>>> that I could see.) >>>>> >>>>> Would appreciate any guidance anyone could offer here. >>>>> >>>>> Thanks, >>>>> >>>>> DR >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups >>>> "Kubernetes user discussion and Q&A" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an >>>> email to kubernetes-users+unsubscr...@googlegroups.com. >>>> To post to this group, send email to kubernetes-users@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/kubernetes-users. >>>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Kubernetes user discussion and Q&A" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to kubernetes-users+unsubscr...@googlegroups.com. >>> To post to this group, send email to kubernetes-users@googlegroups.com. >>> Visit this group at https://groups.google.com/group/kubernetes-users. >>> For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to the Google Groups > "Kubernetes user discussion and Q&A" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to kubernetes-users+unsubscr...@googlegroups.com. > To post to this group, send email to kubernetes-users@googlegroups.com. > Visit this group at https://groups.google.com/group/kubernetes-users. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Kubernetes user discussion and Q&A" group. To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-users+unsubscr...@googlegroups.com. To post to this group, send email to kubernetes-users@googlegroups.com. Visit this group at https://groups.google.com/group/kubernetes-users. For more options, visit https://groups.google.com/d/optout.