Thanks for driving this, Zhu and Lijie.

+1 for the overall proposal. Just share some cents here:

- Why do we need to expose
cluster.resource-blacklist.item.timeout-check-interval to the user?
I think the semantics of `cluster.resource-blacklist.item.timeout` is
sufficient for the user. How to guarantee the timeout mechanism is
Flink's internal implementation. I think it will be very confusing and
we do not need to expose it to users.

- ResourceManager can notify the exception of a task manager to
`BlacklistHandler` as well.
For example, the slot allocation might fail in case the target task
manager is busy or has a network jitter. I don't mean we need to cover
this case in this version, but we can also open a `notifyException` in
`ResourceManagerBlacklistHandler`.

- Before we sync the blocklist to ResourceManager, will the slot of a
blocked task manager continues to be released and allocated?

Best,
Yangze Guo

On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <wangdachui9...@gmail.com> wrote:
>
> Hi Konstantin,
>
> Thanks for your feedback. I will response your 4 remarks:
>
>
> 1) Thanks for reminding me of the controversy. I think “BlockList” is good
> enough, and I will change it in FLIP.
>
>
> 2) Your suggestion for the REST API is a good idea. Based on the above, I
> would change REST API as following:
>
> POST/GET <host>/blocklist/nodes
>
> POST/GET <host>/blocklist/taskmanagers
>
> DELETE <host>/blocklist/node/<identifier>
>
> DELETE <host>/blocklist/taskmanager/<identifier>
>
>
> 3) If a node is blocking/blocklisted, it means that all task managers on
> this node are blocklisted. All slots on these TMs are not available. This
> is actually a bit like TM losts, but these TMs are not really lost, they
> are in an unavailable status, and they are still registered in this flink
> cluster. They will be available again once the corresponding blocklist item
> is removed. This behavior is the same in active/non-active clusters.
> However in the active clusters, these TMs may be released due to idle
> timeouts.
>
>
> 4) For the item timeout, I prefer to keep it. The reasons are as following:
>
> a) The timeout will not affect users adding or removing items via REST API,
> and users can disable it by configuring it to Long.MAX_VALUE .
>
> b) Some node problems can recover after a period of time (such as machine
> hotspots), in which case users may prefer that Flink can do this
> automatically instead of requiring the user to do it manually.
>
>
> Best,
>
> Lijie
>
> Konstantin Knauf <kna...@apache.org> 于2022年4月27日周三 19:23写道:
>
> > Hi Lijie,
> >
> > I think, this makes sense and +1 to only support manually blocking
> > taskmanagers and nodes. Maybe the different strategies can also be
> > maintained outside of Apache Flink.
> >
> > A few remarks:
> >
> > 1) Can we use another term than "bla.cklist" due to the controversy around
> > the term? [1] There was also a Jira Ticket about this topic a while back
> > and there was generally a consensus to avoid the term blacklist & whitelist
> > [2]? We could use "blocklist" "denylist" or "quarantined"
> > 2) For the REST API, I'd prefer a slightly different design as verbs like
> > add/remove often considered an anti-pattern for REST APIs. POST on a list
> > item is generally the standard to add items. DELETE on the individual
> > resource is standard to remove an item.
> >
> > POST <host>/quarantine/items
> > DELETE <host>/quarantine/items/<itemidentifier>
> >
> > We could also consider to separate taskmanagers and nodes in the REST API
> > (and internal data structures). Any opinion on this?
> >
> > POST/GET <host>/quarantine/nodes
> > POST/GET <host>/quarantine/taskmanager
> > DELETE <host>/quarantine/nodes/<identifier>
> > DELETE <host>/quarantine/taskmanager/<identifier>
> >
> > 3) How would blocking nodes behave with non-active resource managers, i.e.
> > standalone or reactive mode?
> >
> > 4) To keep the implementation even more minimal, do we need the timeout
> > behavior? If items are added/removed manually we could delegate this to the
> > user easily. In my opinion the timeout behavior would better fit into
> > specific strategies at a later point.
> >
> > Looking forward to your thoughts.
> >
> > Cheers and thank you,
> >
> > Konstantin
> >
> > [1]
> >
> > https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > [2] https://issues.apache.org/jira/browse/FLINK-18209
> >
> > Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
> > wangdachui9...@gmail.com>:
> >
> > > Hi all,
> > >
> > > Flink job failures may happen due to cluster node issues (insufficient
> > disk
> > > space, bad hardware, network abnormalities). Flink will take care of the
> > > failures and redeploy the tasks. However, due to data locality and
> > limited
> > > resources, the new tasks are very likely to be redeployed to the same
> > > nodes, which will result in continuous task abnormalities and affect job
> > > progress.
> > >
> > > Currently, Flink users need to manually identify the problematic node and
> > > take it offline to solve this problem. But this approach has following
> > > disadvantages:
> > >
> > > 1. Taking a node offline can be a heavy process. Users may need to
> > contact
> > > cluster administors to do this. The operation can even be dangerous and
> > not
> > > allowed during some important business events.
> > >
> > > 2. Identifying and solving this kind of problems manually would be slow
> > and
> > > a waste of human resources.
> > >
> > > To solve this problem, Zhu Zhu and I propose to introduce a blacklist
> > > mechanism for Flink to filter out problematic resources.
> > >
> > >
> > > You can find more details in FLIP-224[1]. Looking forward to your
> > feedback.
> > >
> > > [1]
> > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > >
> > >
> > > Best,
> > >
> > > Lijie
> > >
> >

Reply via email to