[jira] [Commented] (FLINK-26719) Rethink the default reschedule reconcile loop

Gyula Fora (Jira) Fri, 18 Mar 2022 03:32:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-26719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508697#comment-17508697
 ]


Gyula Fora commented on FLINK-26719:
------------------------------------

I agree that in an ideal case once we reach a READY deployment state + job is 
running we could technically stop periodic reonciliation.

There are a few caveats here which tie into what [~wangyang0918] is suggesting.

How much do we trust that once a Flink Deployment is running it will be able to 
self heal, recover?
In cases when it goes into a crash loop, broken state, is there anything the 
operator can do anyways?

If we expect to be able to react to broken deployments , then to guarantee 
SLA-s we actually need frequent rechecks. If we do not want to provide stronger 
resiliency/guarantees than the Flink native integration in itself then I guess 
we do not need to check, or it's enough to check at larger intervals.

With the current logic the best we would do is trigger an ERROR event but we 
wouldn't try to "repair" broken deployments. That is still valuable if the user 
is listening to these events though. Not sure what alternatives we have other 
than the reconcile loop. Maybe as [~matyas] said, listening to events or 
informers could be an alternative but it's still far from an actual funtional 
observe loop.

> Rethink the default reschedule reconcile loop
> ---------------------------------------------
>
>                 Key: FLINK-26719
>                 URL: https://issues.apache.org/jira/browse/FLINK-26719
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Aitozi
>            Priority: Major
>
> When I test locally, I found that it will reschedule and reconcile with the 
> {{operator.reconciler.reschedule.interval.sec}} I doubt why we need this? I 
> think we just need to reconcile
>  # waiting for the status change
>  # receive the new event
>  # waiting for the savepoint result
> So when JobManagerDeploymentStatus is Ready, we do not have to trigger the 
> reconcile except waiting for the savepoint result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26719) Rethink the default reschedule reconcile loop

Reply via email to