czeming commented on issue #9401: URL: https://github.com/apache/dolphinscheduler/issues/9401#issuecomment-1097810831
I want to try to solve this problem. At present, after the execution of an unknown exception fails, it will actually be caught and will not execute `remove` and keep trying again. This should be the reason for the frequent output of the same log (multiple executions of the same data still fail). It is suggested to use `poll` to ensure the deletion of data, but I am not sure whether there is any consideration of retry here. There are two options: 1. Directly use `poll`. If it fails, execute the failure processing 2. Use `poll` and try again for a certain number of times, such as `while`. If it fails, execute the failure processing. Failure handling can also be divided into two types: 1. Output log directly 2. Establish a failure queue globally, print the log and add the failure data to the failure queue. If failure queue is adopted, the function of failure queue needs to be discussed. ====== I found a problem with the existing code: `org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread#handleEvents` called `remove` `org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread#stateEventHandler` called `remove` In fact, when result = true, the call will be repeated and `remove` will traverse the queue once. Using peek to read and execute time-consuming tasks may cause concurrent misreading. For the above reasons, `poll` is recommended. @JinyLeeChina For the specific problems you encounter, you need more detailed stack information to troubleshoot. I've seen the logic of execution. Except for `process_state_change`, there are few abnormal logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
