wcmolin opened a new issue, #13247: URL: https://github.com/apache/dolphinscheduler/issues/13247
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened When the worker node is stopped, an NPE exception will occur when the master fault-tolerant thread starts. I think the problematic code is in this section: `org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient` 482 lines ``` TaskExecutionContext taskExecutionContext = TaskExecutionContextBuilder.get() .buildTaskInstanceRelatedInfo(taskInstance) .buildProcessInstanceRelatedInfo(processInstance) .create(); ``` There is no assignment of processDefineCode and processDefineVersion of taskInstance here. log: ``` [INFO] 2022-12-22 09:14:13.969 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[239] - worker group node : /nodes/worker/default/10.66.76.129:1234 down. [INFO] 2022-12-22 09:14:13.970 org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener:[80] - worker node deleted : /nodes/worker/default/10.66.76.129:1234 [INFO] 2022-12-22 09:14:13.974 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[195] - WORKER node deleted : /nodes/worker/default/10.66.76.129:1234 [INFO] 2022-12-22 09:14:13.978 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[205] - path: /nodes/worker/default/10.66.76.129:1234 not exists [INFO] 2022-12-22 09:14:14.035 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[377] - start worker[10.66.76.129:1234] failover, task list size:3 [INFO] 2022-12-22 09:14:14.040 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[400] - failover task instance id: 416, process instance id: 231 [ERROR] 2022-12-22 09:14:15.070 org.apache.dolphinscheduler.server.utils.ProcessUtils:[211] - kill yarn job failure java.lang.NullPointerException: null at org.apache.dolphinscheduler.server.utils.ProcessUtils.killYarnJob(ProcessUtils.java:197) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverTaskInstance(MasterRegistryClient.java:496) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverWorker(MasterRegistryClient.java:401) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverServerWhenDown(MasterRegistryClient.java:231) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.removeWorkerNodePath(MasterRegistryClient.java:212) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.handleWorkerEvent(MasterRegistryDataListener.java:81) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.notify(MasterRegistryDataListener.java:55) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:127) at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760) at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754) at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753) at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75) at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) at java.util.concurrent.FutureTask.run(FutureTask.java) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) [INFO] 2022-12-22 09:14:15.147 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[504] - workflowExecuteThreadNotify is null, just return, task id:416,process id:231 ``` ### What you expected to happen No NPE exceptions are generated ### How to reproduce Create a task that requires fault tolerance, then stop the worker server. ### Anything else _No response_ ### Version 2.0.x ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
