Hi yingjie,
Thanks for proposing the blacklist! I agree with that black list is
important for job maintenance, since some jobs may not be able to failover
automatically if some tasks are always scheduled to the problematic hosts or
TMs. This will increase the burden of the operators since they need to pay more
attention to the status of the jobs.
I have read the proposal and left some comments. I think a problem is how
we cooperator with external resource managers (like YARN or Mesos) so that they
will apply for resource according to our blacklist. If they cannot fully obey
the blacklist, then we may need to deal with the inappropriate resource.
Looking forward to the future advance of this feature! Thanks again for
the exciting proposal.
Best,
Yun Gao
------------------------------------------------------------------
From:zhijiang <[email protected]>
Send Time:2018 Nov 27 (Tue) 10:40
To:dev <[email protected]>
Subject:回复:[DISCUSS]Enhancing flink scheduler by implementing blacklist
mechanism
Thanks yingjie for bringing this discussion.
I encountered this issue during failover and also noticed other users
complainting related issues in community before.
So it is necessary to have this mechanism for enhancing schedule process first,
and then enrich the internal rules step by step.
Wish this feature working in the next major release. :)
Best,
Zhijiang
------------------------------------------------------------------
发件人:Till Rohrmann <[email protected]>
发送时间:2018年11月5日(星期一) 18:43
收件人:dev <[email protected]>
主 题:Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist mechanism
Thanks for sharing this design document with the community Yingjie.
I like the design to pass the job specific blacklisted TMs as a scheduling
constraint. This makes a lot of sense to me.
Cheers,
Till
On Fri, Nov 2, 2018 at 4:51 PM yingjie <[email protected]> wrote:
> Hi everyone,
>
> This post proposes the blacklist mechanism as an enhancement of flink
> scheduler. The motivation is as follows.
>
> In our clusters, jobs encounter Hardware and software environment problems
> occasionally, including software library missing,bad hardware,resource
> shortage like out of disk space,these problems will lead to task
> failure,the
> failover strategy will take care of that and redeploy the relevant tasks.
> But because of reasons like location preference and limited total
> resources,the failed task will be scheduled to be deployed on the same
> host,
> then the task will fail again and again, many times. The primary cause of
> this problem is the mismatching of task and resource. Currently, the
> resource allocation algorithm does not take these into consideration.
>
> We introduce the blacklist mechanism to solve this problem. The basic idea
> is that when a task fails too many times on some resource, the Scheduler
> will not assign the resource to that task. We have implemented this feature
> in our inner version of flink, and currently, it works fine.
>
> The following is the design draft, we would really appreciate it if you can
> review and comment.
>
> https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw
>
> Best,
> Yingjie
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>