On Tue, Feb 20, 2018 at 3:54 PM, James Peach <jor...@gmail.com> wrote:
> > > On Feb 20, 2018, at 11:11 AM, Zhitao Li <zhitaoli...@gmail.com> wrote: > > > > Hi, > > > > In one of recent Mesos meet up, quite a couple of cluster operators had > > expressed complaints that it is hard to model host issues with Mesos at > the > > moment. > > > > For example, in our environment, the only signal scheduler would know is > > whether Mesos agent has disconnected from the cluster. However, we have a > > family of other issues in real production which makes the hosts > (sometimes > > "partially") unusable. Examples include: > > - traffic routing software malfunction (i.e, haproxy): Mesos agent does > not > > require this so scheduler/deployment system is not aware, but actual > > workload on the cluster will fail; > Zhitao, could you elaborate on this a bit more? Do you mean the workloads are being load-balanced by HAProxy and due to misconfiguration the workloads are now unreachable and somehow the agent should be boiling up these network issues? I am guessing in your case HAProxy is somehow involved in providing connectivity to workloads on a given agent and HAProxy is actually running on that agent? > > - broken disk; > > - other long running system agent issues. > > > > This email is looking at how can Mesos recommend best practice to surface > > these issues to scheduler, and whether we need additional primitives in > > Mesos to achieve such goal. > > In the K8s world the node can publish "conditions" that describe its status > > https://kubernetes.io/docs/concepts/architecture/nodes/#condition > > The condition can automatically taint the node, which could cause pods to > automatically be evicted (ie. if they can't tolerate that specific taint). > > J -- Avinash Sridharan, Mesosphere +1 (323) 702 5245