Hi Lijie,

I think, this makes sense and +1 to only support manually blocking
taskmanagers and nodes. Maybe the different strategies can also be
maintained outside of Apache Flink.

A few remarks:

1) Can we use another term than "bla.cklist" due to the controversy around
the term? [1] There was also a Jira Ticket about this topic a while back
and there was generally a consensus to avoid the term blacklist & whitelist
[2]? We could use "blocklist" "denylist" or "quarantined"
2) For the REST API, I'd prefer a slightly different design as verbs like
add/remove often considered an anti-pattern for REST APIs. POST on a list
item is generally the standard to add items. DELETE on the individual
resource is standard to remove an item.

POST <host>/quarantine/items
DELETE <host>/quarantine/items/<itemidentifier>

We could also consider to separate taskmanagers and nodes in the REST API
(and internal data structures). Any opinion on this?

POST/GET <host>/quarantine/nodes
POST/GET <host>/quarantine/taskmanager
DELETE <host>/quarantine/nodes/<identifier>
DELETE <host>/quarantine/taskmanager/<identifier>

3) How would blocking nodes behave with non-active resource managers, i.e.
standalone or reactive mode?

4) To keep the implementation even more minimal, do we need the timeout
behavior? If items are added/removed manually we could delegate this to the
user easily. In my opinion the timeout behavior would better fit into
specific strategies at a later point.

Looking forward to your thoughts.

Cheers and thank you,

Konstantin

[1]
https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
[2] https://issues.apache.org/jira/browse/FLINK-18209

Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang <
wangdachui9...@gmail.com>:

> Hi all,
>
> Flink job failures may happen due to cluster node issues (insufficient disk
> space, bad hardware, network abnormalities). Flink will take care of the
> failures and redeploy the tasks. However, due to data locality and limited
> resources, the new tasks are very likely to be redeployed to the same
> nodes, which will result in continuous task abnormalities and affect job
> progress.
>
> Currently, Flink users need to manually identify the problematic node and
> take it offline to solve this problem. But this approach has following
> disadvantages:
>
> 1. Taking a node offline can be a heavy process. Users may need to contact
> cluster administors to do this. The operation can even be dangerous and not
> allowed during some important business events.
>
> 2. Identifying and solving this kind of problems manually would be slow and
> a waste of human resources.
>
> To solve this problem, Zhu Zhu and I propose to introduce a blacklist
> mechanism for Flink to filter out problematic resources.
>
>
> You can find more details in FLIP-224[1]. Looking forward to your feedback.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>
>
> Best,
>
> Lijie
>

Reply via email to