Re: Surfacing additional issues on agent host to schedulers

Avinash Sridharan Wed, 21 Feb 2018 09:46:19 -0800

On Tue, Feb 20, 2018 at 3:54 PM, James Peach <[email protected]> wrote:


>
> > On Feb 20, 2018, at 11:11 AM, Zhitao Li <[email protected]> wrote:
> >
> > Hi,
> >
> > In one of recent Mesos meet up, quite a couple of cluster operators had
> > expressed complaints that it is hard to model host issues with Mesos at
> the
> > moment.
> >
> > For example, in our environment, the only signal scheduler would know is
> > whether Mesos agent has disconnected from the cluster. However, we have a
> > family of other issues in real production which makes the hosts
> (sometimes
> > "partially") unusable. Examples include:
> > - traffic routing software malfunction (i.e, haproxy): Mesos agent does
> not
> > require this so scheduler/deployment system is not aware, but actual
> > workload on the cluster will fail;
>
Zhitao, could you elaborate on this a bit more? Do you mean the workloads
are being load-balanced by HAProxy and due to misconfiguration the
workloads are now unreachable and somehow the agent should be boiling up
these network issues? I am guessing in your case HAProxy is somehow
involved in providing connectivity to workloads on a given agent and
HAProxy is actually running on that agent?


> > - broken disk;
> > - other long running system agent issues.
> >
> > This email is looking at how can Mesos recommend best practice to surface
> > these issues to scheduler, and whether we need additional primitives in
> > Mesos to achieve such goal.
>
> In the K8s world the node can publish "conditions" that describe its status
>
>         https://kubernetes.io/docs/concepts/architecture/nodes/#condition
>
> The condition can automatically taint the node, which could cause pods to
> automatically be evicted (ie. if they can't tolerate that specific taint).
>
> J




-- 
Avinash Sridharan, Mesosphere
+1 (323) 702 5245

Re: Surfacing additional issues on agent host to schedulers

Reply via email to