yihao-tcf commented on issue #14923:
URL: 
https://github.com/apache/dolphinscheduler/issues/14923#issuecomment-1782138666

   > I have also experienced such problems, which seriously affect the normal 
operation of scheduled tasks. **version**:3.1.7
   > 
   > **Reproduction steps**
   > 
   > 1. Create Work groups
   >    
![2f5457678c7943eaecbdba3976b51cb2](https://user-images.githubusercontent.com/65161474/273366969-888f1d3a-5843-4bd5-981c-db111f0854a4.png)
   > 2. Create Test Tasks(Three different working groups)
   >    Task 1 use '大数据集群' worker group
   > 
   > 
![71125c534dbed5a09d2402022c8fd444](https://user-images.githubusercontent.com/65161474/273366981-39e4f000-1a90-44a5-9f7e-bd415574305f.png)
   > 
   > Task 2 use '算法集群' worker group
   > 
   > 
![2a6a238c512ac1c8ef2b539d3732e01f](https://user-images.githubusercontent.com/65161474/273366987-6f38fef6-1922-43b7-bed5-4bfa8ad42b59.png)
   > 
   > Task 3 use 'default' worker group
   > 
   > 
![a44759155e6f9e76d488f780b940b25a](https://user-images.githubusercontent.com/65161474/273366991-d0d1fc4f-0f7f-47d2-8dad-1785a3b1af33.png)
   > 
   > Set scheduled execution,online workflow and scheduling 3. Stop the worker 
of the '算法集群'`* * * * * ? *` 
![23383d0055ee3d585f29501206eae990](https://user-images.githubusercontent.com/65161474/273367007-b35bcda6-51c3-4a2f-bb0f-bf355e58628a.png)
   > 
   > **After stopping, all tasks will be affected and an error will be reported 
as task instance is null or host is null.**
   > 
   > After stopping woker group info: 
![9b4d332c44a6e62b7ca0854e98456d12](https://user-images.githubusercontent.com/65161474/273366917-aefc0176-213e-416d-b7e4-00ac23b161f9.png)
   > 
   > After stopping task execution status: 
![b6582c94c541f103d7d43f14b9ec55ab](https://user-images.githubusercontent.com/65161474/273366920-90e4114c-7589-48a0-9912-e0ee278e4acb.png)
 
![5e80093fa06d528d3b5303f2b0e5fa69](https://user-images.githubusercontent.com/65161474/273366923-dac2288b-aaac-4e55-b9a0-489031da3cad.png)
 
![91f4f3c91f706e2cc5f750a1b482c496](https://user-images.githubusercontent.com/65161474/273366929-15f3e7f5-3d35-49cc-a65b-871050a85382.png)
 
![576601d98be9c6127e3bd64a1820a0b7](https://user-images.githubusercontent.com/65161474/273366931-14466014-bfab-46fb-9a6b-c1f9258cdc59.png)
   > 
   > error logs message: 
![98e57f541dc77e1eabe63996bc65c22a](https://user-images.githubusercontent.com/65161474/273366951-78b647cc-08b0-4e63-805a-980fe828460e.png)
   > 
   > After starting the stopped worker, the task resumes execution. A large 
number of backlogged tasks are resuming execution, which may cause the cluster 
to be in a high load state and lead to the downtime of other services in the 
cluster. This is very dangerous
   > 
   > I don't quite understand why the node downtime of the '算法集群' worker group 
affects the tasks of other worker group nodes
   
   @kezhenyang163 
   From the steps of replication, it can be seen that when the nodes 
corresponding to my "算法集群" workgroup were closed, the tasks of the "大数据集群" and 
"default" workgroups were also affected and could not work properly. I think 
this is very unreasonable, as one of the executor nodes goes down, causing all 
nodes in the current system to be affected. Do you think it's reasonable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to