[jira] [Commented] (FLINK-31998) Flink Operator Deadlock on run job Failure

Zhenqiu Huang (Jira) Sat, 13 May 2023 21:08:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722434#comment-17722434
 ]


Zhenqiu Huang commented on FLINK-31998:
---------------------------------------

[~gyfora] Technically, if a session job is created, it is actually a session 
cluster that can run multiple jobs in parallel or sequentially. But from 
session job CRD, the cluster to job mapping is 1 to 1. We probably need to 
adjust the CRD to decouple the job status and session cluster status.

> Flink Operator Deadlock on run job Failure
> ------------------------------------------
>
>                 Key: FLINK-31998
>                 URL: https://issues.apache.org/jira/browse/FLINK-31998
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.2.0, kubernetes-operator-1.3.0, 
> kubernetes-operator-1.4.0
>            Reporter: Ahmed Hamdy
>            Priority: Major
>             Fix For: kubernetes-operator-1.6.0
>
>         Attachments: gleek-m6pLe3Wy--IpCKQavAQwBQ.png
>
>
> h2. Description
> FlinkOperator Reconciler goes into deadlock situation where it never udpates 
> Session job to DEPLOYED/ROLLED_BACK if {{deploy}} fails.
> Attached sequence diagram of the issue where FlinkSessionJob is stuck in 
> UPGRADING indefinitely.
> h2. proposed fix
> Reconciler should roll back changes CR if 
> {{reconciliationStatus.isBeforeFirstDeployment()}} fails to {{{}deploy(){}}}.
> [diagram|https://issues.apache.org/7239bb39-60d8-48a0-9052-f3231947edbe]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-31998) Flink Operator Deadlock on run job Failure

Reply via email to