[GitHub] [dolphinscheduler] Gallardot opened a new issue, #14723: [Bug] [Master] Master NPE after a worker restart

via GitHub Tue, 08 Aug 2023 20:18:19 -0700


Gallardot opened a new issue, #14723:
URL: https://github.com/apache/dolphinscheduler/issues/14723


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   We found some NullPointerExceptions on the master server.
   
   ```
   [INFO] 2023-08-09 11:04:21.021 +0800 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[290] 
- [WorkflowInstance-789][TaskInstance-789] - Begin to handle state event, 
TaskStateEvent(processInstanceId=789, taskInstanceId=789, taskCode=0, 
status=TaskExecutionStatus{code=8, desc='need fault tolerance'}, 
type=TASK_STATE_CHANGE, key=null, channel=null, context=null)
   [INFO] 2023-08-09 11:04:21.021 +0800 
org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler:[54] - 
[WorkflowInstance-789][TaskInstance-789] - Handle task instance state event, 
the current task instance state NEED_FAULT_TOLERANCE will be changed to 
NEED_FAULT_TOLERANCE
   [INFO] 2023-08-09 11:04:21.022 +0800 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[393] 
- [WorkflowInstance-789][TaskInstance-789] - TaskInstance finished task 
code:10503916731424 state:TaskExecutionStatus{code=8, desc='need fault 
tolerance'}
   [INFO] 2023-08-09 11:04:21.022 +0800 
org.apache.dolphinscheduler.server.master.runner.StateWheelExecuteThread:[186] 
- [WorkflowInstance-789][TaskInstance-789] - remove task instance from timeout 
check list
   [INFO] 2023-08-09 11:04:21.022 +0800 
org.apache.dolphinscheduler.server.master.runner.StateWheelExecuteThread:[209] 
- [WorkflowInstance-789][TaskInstance-789] - remove task instance from retry 
check list
   [INFO] 2023-08-09 11:04:21.022 +0800 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[414] 
- [WorkflowInstance-789][TaskInstance-789] - Retry taskInstance taskInstance 
state: TaskExecutionStatus{code=8, desc='need fault tolerance'}
   [WARN] 2023-08-09 11:04:21.022 +0800 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[1818] 
- [WorkflowInstance-789][TaskInstance-789] - Task already exists in ready 
submit queue, no need to add again, task code:10503916731424
   [ERROR] 2023-08-09 11:04:21.022 +0800 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[441] 
- [WorkflowInstance-789][TaskInstance-789] - Task finish failed, get a 
exception, will remove this taskInstance from completeTaskSet
   java.lang.NullPointerException: id is marked non-null but is null
        at 
org.apache.dolphinscheduler.dao.repository.BaseDao.queryById(BaseDao.java:41)
        at 
org.apache.dolphinscheduler.dao.repository.BaseDao$$FastClassBySpringCGLIB$$ca36ec34.invoke(<generated>)
        at 
org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
        at 
org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)
        at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at 
org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
        at 
org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137)
        at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at 
org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
        at 
org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)
        at 
org.apache.dolphinscheduler.dao.repository.impl.TaskInstanceDaoImpl$$EnhancerBySpringCGLIB$$4774b4ba.queryById(<generated>)
        at 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.submitStandByTask(WorkflowExecuteRunnable.java:1906)
        at 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.retryTaskInstance(WorkflowExecuteRunnable.java:510)
        at 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.taskFinished(WorkflowExecuteRunnable.java:415)
        at 
org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler.handleStateEvent(TaskStateEventHandler.java:74)
        at 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.handleEvents(WorkflowExecuteRunnable.java:291)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   [ERROR] 2023-08-09 11:04:21.022 +0800 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[315] 
- [WorkflowInstance-789][TaskInstance-789] - State event handle error, get a 
unknown exception, will retry this event: TaskStateEvent(processInstanceId=789, 
taskInstanceId=789, taskCode=0, status=TaskExecutionStatus{code=8, desc='need 
fault tolerance'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null)
   java.lang.NullPointerException: id is marked non-null but is null
        at 
org.apache.dolphinscheduler.dao.repository.BaseDao.queryById(BaseDao.java:41)
        at 
org.apache.dolphinscheduler.dao.repository.BaseDao$$FastClassBySpringCGLIB$$ca36ec34.invoke(<generated>)
        at 
org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
        at 
org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)
        at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at 
org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
        at 
org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137)
        at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at 
org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
        at 
org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)
        at 
org.apache.dolphinscheduler.dao.repository.impl.TaskInstanceDaoImpl$$EnhancerBySpringCGLIB$$4774b4ba.queryById(<generated>)
        at 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.submitStandByTask(WorkflowExecuteRunnable.java:1906)
        at 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.retryTaskInstance(WorkflowExecuteRunnable.java:510)
        at 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.taskFinished(WorkflowExecuteRunnable.java:415)
        at 
org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler.handleStateEvent(TaskStateEventHandler.java:74)
        at 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.handleEvents(WorkflowExecuteRunnable.java:291)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   ```
   
   ## note
   These logs will continue to be generated repeatedly.
   
   ### What you expected to happen
   
   no error
   
   ### How to reproduce
   
   A new cluster has been deployed through k8s with the dev branch, and MySQL 
is used as the database. 
   
   server: 1 master, 1 worker.
   
   The following steps were taken:
   
   1. Created a shell task with the script content 'exit 10086;' and set the 
task to retry once.
   2. Created a corresponding workflow with a serial wait execution policy.
   3. Created a scheduled task to execute every minute.
   
   After waiting for the task to execute several times.
   
   1.  Restart the only one worker node.
   2. The master node continues to generate tasks and detects that the worker 
node is unavailable, triggering fault-tolerant behavior. The above exception 
logs are then thrown.
   3. After the worker node restarts, the master node continues to generate 
tasks and the worker node is able to complete new tasks. 
   4. Both the master and worker nodes appear to be working properly, and the 
tasks are actually completed.
   
   However, the master node continues to throw the above-mentioned logs.
   
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   dev
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [dolphinscheduler] Gallardot opened a new issue, #14723: [Bug] [Master] Master NPE after a worker restart

Reply via email to