[
https://issues.apache.org/jira/browse/FLINK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883173#comment-15883173
]
ASF GitHub Bot commented on FLINK-5501:
---------------------------------------
Github user StephanEwen commented on the issue:
https://github.com/apache/flink/pull/3385
With the problem observed above, I think we should change the approach a
bit:
- The registry should have an enum that it returns:
`getJobSchedulingStatus` or so, which can be `PENDING`, `RUNNING`, and `DONE`.
That way there is only one access to the registry and we don't have the problem
that the internal status is changed between checks.
- The file-based registry would create one file for the transition to
`RUNNING` and another for the transition to `DONE`. Important is that the
transition to `DONE` does not remove the file for `RUNNING`. The status check
checks backwards - first for the `DONE` file, then for the `RUNNING` file.
> Determine whether the job starts from last JobManager failure
> -------------------------------------------------------------
>
> Key: FLINK-5501
> URL: https://issues.apache.org/jira/browse/FLINK-5501
> Project: Flink
> Issue Type: Sub-task
> Components: JobManager
> Reporter: zhijiang
> Assignee: shuai.xu
>
> When the {{JobManagerRunner}} grants leadership, it should check whether the
> current job is already running or not. If the job is running, the
> {{JobManager}} should reconcile itself (enter RECONCILING state) and waits
> for the {{TaskManager}} reporting task status. Otherwise the {{JobManger}}
> can schedule the {{ExecutionGraph}} in common way.
> The {{RunningJobsRegistry}} can provide the way to check the job running
> status, but we should expand the current interface and fix the related
> process to support this function.
> 1. {{RunningJobsRegistry}} sets RUNNING status after {{JobManagerRunner}}
> granting leadership at the first time.
> 2. If the job finishes, the job status will be set FINISHED by
> {{RunningJobsRegistry}} and the status will be deleted before exit.
> 3. If the mini cluster starts multi {{JobManagerRunner}}, and the leader
> {{JobManagerRunner}} already finishes the job to set the job status FINISHED,
> other {{JobManagerRunner}} will exit after grants the leadership again.
> 4. If the {{JobManager}} fails, the job status will be still in RUNNING. So
> if the {{JobManagerRunner}} (the previous or new one) grants leadership
> again, it will check the job status and enters {{RECONCILING}} state.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)