EricJoy2048 opened a new issue, #4042: URL: https://github.com/apache/incubator-seatunnel/issues/4042
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues. ### What happened When the master node of zeta cluster done, the JobMaster of the running job will lost. An we create a new JobMaster and init it in the new master node of zeta cluster. But, sometimes the init method may throw Exception and the task running in the work node will never be stop. ### SeaTunnel Version dev ### SeaTunnel Config ```conf ClusterFaultToleranceTwoPipelineIT.testTwoPipelineStreamJobRestoreIn2NodeMasterDown() ``` ### Running Command ```shell ClusterFaultToleranceTwoPipelineIT.testTwoPipelineStreamJobRestoreIn2NodeMasterDown() ``` ### Error Exception ```log 2023-02-03T02:29:01.1209875Z 2023-02-03 02:29:01,117 INFO com.hazelcast.internal.partition.InternalPartitionService - [localhost]:5802 [runner_ClusterFaultToleranceTwoPipelineIT_testTwoPipelineStreamJobRestoreIn2NodeMasterDown] [5.1] Applying the most recent of partition state... 2023-02-03T02:29:01.1211100Z 2023-02-03 02:29:01,118 INFO com.hazelcast.internal.partition.impl.MigrationManager - [localhost]:5802 [runner_ClusterFaultToleranceTwoPipelineIT_testTwoPipelineStreamJobRestoreIn2NodeMasterDown] [5.1] Partition balance is ok, no need to repartition. 2023-02-03T02:29:01.1286216Z 2023-02-03 02:29:01,128 INFO org.apache.seatunnel.connectors.seatunnel.fake.source.FakeSourceReader - 200 rows of data have been generated in split(8). Generation time: 1675391340604 2023-02-03T02:29:01.1811639Z 2023-02-03 02:29:01,146 INFO org.apache.seatunnel.engine.server.master.JobMaster - Init JobMaster for Job testTwoPipelineStreamJobRestoreIn2NodeMasterDown (673716497142513665) 2023-02-03T02:29:01.1815926Z 2023-02-03 02:29:01,179 INFO org.apache.seatunnel.engine.server.master.JobMaster - Job testTwoPipelineStreamJobRestoreIn2NodeMasterDown (673716497142513665) needed jar urls [] 2023-02-03T02:29:01.2087258Z 2023-02-03 02:29:01,207 ERROR org.apache.seatunnel.engine.server.CoordinatorService - [localhost]:5802 [runner_ClusterFaultToleranceTwoPipelineIT_testTwoPipelineStreamJobRestoreIn2NodeMasterDown] [5.1] org.apache.seatunnel.engine.common.exception.SeaTunnelEngineException: java.util.concurrent.ExecutionException: org.apache.seatunnel.engine.common.exception.SeaTunnelEngineException: Job id 673716497142513665 init JobMaster failed 2023-02-03T02:29:01.2088670Z at org.apache.seatunnel.engine.server.CoordinatorService.initCoordinatorService(CoordinatorService.java:195) 2023-02-03T02:29:01.2089465Z at org.apache.seatunnel.engine.server.CoordinatorService.checkNewActiveMaster(CoordinatorService.java:276) 2023-02-03T02:29:01.2090018Z at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2023-02-03T02:29:01.2090464Z at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 2023-02-03T02:29:01.2090996Z at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) 2023-02-03T02:29:01.2091573Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 2023-02-03T02:29:01.2092067Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 2023-02-03T02:29:01.2092460Z at java.base/java.lang.Thread.run(Thread.java:829) 2023-02-03T02:29:01.2093017Z Caused by: java.util.concurrent.ExecutionException: org.apache.seatunnel.engine.common.exception.SeaTunnelEngineException: Job id 673716497142513665 init JobMaster failed 2023-02-03T02:29:01.2093649Z at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) 2023-02-03T02:29:01.2094120Z at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) 2023-02-03T02:29:01.2094679Z at org.apache.seatunnel.engine.server.CoordinatorService.initCoordinatorService(CoordinatorService.java:193) 2023-02-03T02:29:01.2095090Z ... 7 more 2023-02-03T02:29:01.2095501Z Caused by: org.apache.seatunnel.engine.common.exception.SeaTunnelEngineException: Job id 673716497142513665 init JobMaster failed 2023-02-03T02:29:01.2096184Z at org.apache.seatunnel.engine.server.CoordinatorService.restoreJobFromMasterActiveSwitch(CoordinatorService.java:220) 2023-02-03T02:29:01.2096866Z at org.apache.seatunnel.engine.server.CoordinatorService.lambda$initCoordinatorService$0(CoordinatorService.java:186) 2023-02-03T02:29:01.2097407Z at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) 2023-02-03T02:29:01.2097726Z ... 3 more 2023-02-03T02:29:01.2098227Z Caused by: org.apache.seatunnel.engine.checkpoint.storage.exception.CheckpointStorageException: No checkpoint found, job(673716497142513665), pipeline(2), checkpoint(1) 2023-02-03T02:29:01.2098947Z at org.apache.seatunnel.engine.checkpoint.storage.hdfs.HdfsStorage.getCheckpoint(HdfsStorage.java:225) 2023-02-03T02:29:01.2099579Z at org.apache.seatunnel.engine.server.checkpoint.CheckpointManager.lambda$new$0(CheckpointManager.java:110) 2023-02-03T02:29:01.2100354Z at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) 2023-02-03T02:29:01.2100801Z at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1693) 2023-02-03T02:29:01.2101248Z at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) 2023-02-03T02:29:01.2101716Z at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) 2023-02-03T02:29:01.2102247Z at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:952) 2023-02-03T02:29:01.2102647Z at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:926) 2023-02-03T02:29:01.2103049Z at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327) 2023-02-03T02:29:01.2103478Z at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746) 2023-02-03T02:29:01.2103913Z at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) 2023-02-03T02:29:01.2104353Z at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.helpCC(ForkJoinPool.java:1115) 2023-02-03T02:29:01.2104827Z at java.base/java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:1957) 2023-02-03T02:29:01.2105318Z at java.base/java.util.concurrent.ForkJoinTask.tryExternalHelp(ForkJoinTask.java:378) 2023-02-03T02:29:01.2105796Z at java.base/java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:323) 2023-02-03T02:29:01.2106256Z at java.base/java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:412) 2023-02-03T02:29:01.2106682Z at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:736) 2023-02-03T02:29:01.2107153Z at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:919) 2023-02-03T02:29:01.2107597Z at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) 2023-02-03T02:29:01.2108050Z at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) 2023-02-03T02:29:01.2108575Z at org.apache.seatunnel.engine.server.checkpoint.CheckpointManager.<init>(CheckpointManager.java:124) 2023-02-03T02:29:01.2109108Z at org.apache.seatunnel.engine.server.master.JobMaster.init(JobMaster.java:184) 2023-02-03T02:29:01.2109729Z at org.apache.seatunnel.engine.server.CoordinatorService.restoreJobFromMasterActiveSwitch(CoordinatorService.java:218) 2023-02-03T02:29:01.2110178Z ... 5 more 2023-02-03T02:29:01.2110300Z ``` ### Flink or Spark Version _No response_ ### Java or Scala Version _No response_ ### Screenshots _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
