Gallardot opened a new issue, #14723: URL: https://github.com/apache/dolphinscheduler/issues/14723
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened We found some NullPointerExceptions on the master server. ``` [INFO] 2023-08-09 11:04:21.021 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[290] - [WorkflowInstance-789][TaskInstance-789] - Begin to handle state event, TaskStateEvent(processInstanceId=789, taskInstanceId=789, taskCode=0, status=TaskExecutionStatus{code=8, desc='need fault tolerance'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2023-08-09 11:04:21.021 +0800 org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler:[54] - [WorkflowInstance-789][TaskInstance-789] - Handle task instance state event, the current task instance state NEED_FAULT_TOLERANCE will be changed to NEED_FAULT_TOLERANCE [INFO] 2023-08-09 11:04:21.022 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[393] - [WorkflowInstance-789][TaskInstance-789] - TaskInstance finished task code:10503916731424 state:TaskExecutionStatus{code=8, desc='need fault tolerance'} [INFO] 2023-08-09 11:04:21.022 +0800 org.apache.dolphinscheduler.server.master.runner.StateWheelExecuteThread:[186] - [WorkflowInstance-789][TaskInstance-789] - remove task instance from timeout check list [INFO] 2023-08-09 11:04:21.022 +0800 org.apache.dolphinscheduler.server.master.runner.StateWheelExecuteThread:[209] - [WorkflowInstance-789][TaskInstance-789] - remove task instance from retry check list [INFO] 2023-08-09 11:04:21.022 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[414] - [WorkflowInstance-789][TaskInstance-789] - Retry taskInstance taskInstance state: TaskExecutionStatus{code=8, desc='need fault tolerance'} [WARN] 2023-08-09 11:04:21.022 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[1818] - [WorkflowInstance-789][TaskInstance-789] - Task already exists in ready submit queue, no need to add again, task code:10503916731424 [ERROR] 2023-08-09 11:04:21.022 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[441] - [WorkflowInstance-789][TaskInstance-789] - Task finish failed, get a exception, will remove this taskInstance from completeTaskSet java.lang.NullPointerException: id is marked non-null but is null at org.apache.dolphinscheduler.dao.repository.BaseDao.queryById(BaseDao.java:41) at org.apache.dolphinscheduler.dao.repository.BaseDao$$FastClassBySpringCGLIB$$ca36ec34.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) at org.apache.dolphinscheduler.dao.repository.impl.TaskInstanceDaoImpl$$EnhancerBySpringCGLIB$$4774b4ba.queryById(<generated>) at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.submitStandByTask(WorkflowExecuteRunnable.java:1906) at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.retryTaskInstance(WorkflowExecuteRunnable.java:510) at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.taskFinished(WorkflowExecuteRunnable.java:415) at org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler.handleStateEvent(TaskStateEventHandler.java:74) at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.handleEvents(WorkflowExecuteRunnable.java:291) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) [ERROR] 2023-08-09 11:04:21.022 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[315] - [WorkflowInstance-789][TaskInstance-789] - State event handle error, get a unknown exception, will retry this event: TaskStateEvent(processInstanceId=789, taskInstanceId=789, taskCode=0, status=TaskExecutionStatus{code=8, desc='need fault tolerance'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) java.lang.NullPointerException: id is marked non-null but is null at org.apache.dolphinscheduler.dao.repository.BaseDao.queryById(BaseDao.java:41) at org.apache.dolphinscheduler.dao.repository.BaseDao$$FastClassBySpringCGLIB$$ca36ec34.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) at org.apache.dolphinscheduler.dao.repository.impl.TaskInstanceDaoImpl$$EnhancerBySpringCGLIB$$4774b4ba.queryById(<generated>) at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.submitStandByTask(WorkflowExecuteRunnable.java:1906) at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.retryTaskInstance(WorkflowExecuteRunnable.java:510) at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.taskFinished(WorkflowExecuteRunnable.java:415) at org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler.handleStateEvent(TaskStateEventHandler.java:74) at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.handleEvents(WorkflowExecuteRunnable.java:291) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` ## note These logs will continue to be generated repeatedly. ### What you expected to happen no error ### How to reproduce A new cluster has been deployed through k8s with the dev branch, and MySQL is used as the database. server: 1 master, 1 worker. The following steps were taken: 1. Created a shell task with the script content 'exit 10086;' and set the task to retry once. 2. Created a corresponding workflow with a serial wait execution policy. 3. Created a scheduled task to execute every minute. After waiting for the task to execute several times. 1. Restart the only one worker node. 2. The master node continues to generate tasks and detects that the worker node is unavailable, triggering fault-tolerant behavior. The above exception logs are then thrown. 3. After the worker node restarts, the master node continues to generate tasks and the worker node is able to complete new tasks. 4. Both the master and worker nodes appear to be working properly, and the tasks are actually completed. However, the master node continues to throw the above-mentioned logs. ### Anything else _No response_ ### Version dev ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
