wcmolin opened a new issue, #13247:
URL: https://github.com/apache/dolphinscheduler/issues/13247

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When the worker node is stopped, an NPE exception will occur when the master 
fault-tolerant thread starts. I think the problematic code is in this section:
   `org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient` 
482 lines
   ```
   TaskExecutionContext taskExecutionContext = TaskExecutionContextBuilder.get()
           .buildTaskInstanceRelatedInfo(taskInstance)
           .buildProcessInstanceRelatedInfo(processInstance)
           .create();
   ```
   There is no assignment of processDefineCode and processDefineVersion of 
taskInstance here.
   
   log:
   ```
   
   [INFO] 2022-12-22 09:14:13.969 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[239] - 
worker group node : /nodes/worker/default/10.66.76.129:1234 down.
   [INFO] 2022-12-22 09:14:13.970 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener:[80]
 - worker node deleted : /nodes/worker/default/10.66.76.129:1234
   [INFO] 2022-12-22 09:14:13.974 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[195] - 
WORKER node deleted : /nodes/worker/default/10.66.76.129:1234
   [INFO] 2022-12-22 09:14:13.978 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[205] - 
path: /nodes/worker/default/10.66.76.129:1234 not exists
   [INFO] 2022-12-22 09:14:14.035 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[377] - 
start worker[10.66.76.129:1234] failover, task list size:3
   [INFO] 2022-12-22 09:14:14.040 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[400] - 
failover task instance id: 416, process instance id: 231
   [ERROR] 2022-12-22 09:14:15.070 
org.apache.dolphinscheduler.server.utils.ProcessUtils:[211] - kill yarn job 
failure
   java.lang.NullPointerException: null
        at 
org.apache.dolphinscheduler.server.utils.ProcessUtils.killYarnJob(ProcessUtils.java:197)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverTaskInstance(MasterRegistryClient.java:496)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverWorker(MasterRegistryClient.java:401)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverServerWhenDown(MasterRegistryClient.java:231)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.removeWorkerNodePath(MasterRegistryClient.java:212)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.handleWorkerEvent(MasterRegistryDataListener.java:81)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.notify(MasterRegistryDataListener.java:55)
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:127)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
        at 
org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
        at 
org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at 
org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
        at 
org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
        at 
org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
        at java.util.concurrent.FutureTask.run(FutureTask.java)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
   [INFO] 2022-12-22 09:14:15.147 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[504] - 
workflowExecuteThreadNotify is null, just return, task id:416,process id:231
   ```
   
   ### What you expected to happen
   
   No NPE exceptions are generated
   
   ### How to reproduce
   
   Create a task that requires fault tolerance, then stop the worker server.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   2.0.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to