tillrohrmann commented on pull request #15577: URL: https://github.com/apache/flink/pull/15577#issuecomment-819694826
The problem seems to be the following: Since we create the `JobMasterService` lazily it can happen that the `DispacherJob` is in the initialized state (`JobManagerRunner` being created) but the `JobMasterService` is not running/has not been created. If now the client polls the `DispatcherJob.requestJobStatus()`, the system will ask the `JobManagerRunner.getJobMasterGateway().requestJob()`. The `JobManagerRunner.getJobMasterGateway` might not be completed. If now the `JobMasterService` creation fails, then the `JobManagerRunnerImpl` will complete the `resultFuture` which leads to the shut down of the `DispatcherJob` and then also the `JobManagerRunnerImpl`. Due to this shut down, the system will complete the `leaderGatewayFuture` exceptionally which causes the initial `DispatcherJob.requestJobStatus` to fail with `org.apache.flink.util.FlinkException: JobMaster has been shut down.`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
