[jira] [Commented] (FLINK-26370) Make Flink cluster communication asynchronous

Gyula Fora (Jira) Tue, 01 Mar 2022 00:30:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-26370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499391#comment-17499391
 ]


Gyula Fora commented on FLINK-26370:
------------------------------------

Thank you [~kelemensanyi] for the thorough assessment of the problem.

I think the original motivation of the ticket was that long sync calls in the 
main reconcile loop basically block other operations on a given resource. I 
think given the fact that these user triggered operations should not be too 
frequent this is not a huge pain point in my view as long as we make sure the 
operator has enough threads we should be good (basically option 2)

I think you suggestion for 3 is interesting and it represent the ideal world 
scenario where operations would execute in the background and progress tracking 
would happen through the status of the resource. As you outlined this is a 
quite complex mechanism with a bunch of corner cases to guard against so we 
have to decide together if the added complexity is worth it.

I would love to hear the opinion of others on this now that we have a good 
description of the problem.

cc [~thw] [~wangyang0918] 

> Make Flink cluster communication asynchronous
> ---------------------------------------------
>
>                 Key: FLINK-26370
>                 URL: https://issues.apache.org/jira/browse/FLINK-26370
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Kubernetes Operator
>            Reporter: Gyula Fora
>            Assignee: Sandor Kelemen
>            Priority: Major
>
> In the current architecture calls to the flink clusters (through the rest 
> client) are made synchronously from the reconcile loop. 
> These calls often take a long time due to various (compeltely normal) reasons:
>  - Cluster is not ready -> long call + timeoutexception
>  - Operation takes a long time -> cancel/savepoint operations are often 
> expected to take seconds/minutes
> Both the observer and reconciler components make these calls.
> We should come up with a way to avoid making these sync calls from the main 
> loop while still preserving the logic of the operator.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26370) Make Flink cluster communication asynchronous

Reply via email to