Hello everyone!

I found a problem and give a solution about DS. Expect everyone's advices. 
Thank you all!


Describe the question

The worker load balance solution in the dev branch is a good feature, and it's 
based on the `weight` and `start time` of the worker.

- `weight` is configured by `worker.weight`
- `start time` is set when the worker is registered to zookeeper

The zookeeper registration path of the worker is 
`/dolphinscheduler/nodes/worker/default/<ip>:<port>:<weight>:<startTime>`, for 
example 
`/dolphinscheduler/nodes/worker/default/198.18.0.1:1234:100:1615022079945`, 
which is different from `/dolphinscheduler/nodes/worker/default/<ip>:<port>` in 
1.3.x release.

Both of them are used in the class `RandomHostManager`, `RoundRobinHostManager` 
and `RoundRobinHostManager` to calculate the weight of the worker and select 
the best worker to dispatch task.

However, because the `weight` and `start time` are placed in the zookeeper 
registration path of the worker, some problems are introduced:

- There will be problems in all places that depend on or refer to the 
`/dolphinscheduler/nodes/worker/default/<ip>:<port>` path as follows. 
Furthermore, we need more work to fix these problems:
  - worker fault tolerance #4757
  - worker `unRegistry` 
  - worker `handleDeadServer`
  - make confusing as follows:
Picture 1:
![image](https://user-images.githubusercontent.com/4902714/110206106-d5243680-7eb6-11eb-8493-2685c1c9f9fe.png)
Picture 2:
![image](https://user-images.githubusercontent.com/4902714/110206102-d2294600-7eb6-11eb-9084-552c48e79e0b.png)
- The design of the class `Host` ([source 
code](https://github.com/apache/incubator-dolphinscheduler/blob/dev/dolphinscheduler-remote/src/main/java/org/apache/dolphinscheduler/remote/utils/Host.java))
 is unreasonable. The attribute `weight`, `startTime`, and `workGroup` should 
not be placed in this class, which will cause misuse or even potential bugs.

Improvement Solution

- Still use the same registration path 
`/dolphinscheduler/nodes/worker/default/<ip>:<port>` in 1.3.x release, so all 
of the above mentioned and many potential problems can be avoided
- Place `weight` into the znode data of 
`/dolphinscheduler/nodes/worker/default/<ip>:<port>`, and just keep the 
compatibility with the 1.3.x version
- `startTime` is already included in the znode data, and just read it.
- Remove the attribute `weight`, `startTime`, and `workGroup` in the class 
`Host`, maybe introduce a new class to process these attributes. This will 
avoid misuse of the class `Host`

Which version of DolphinScheduler
 -[dev]

Related issue: https://github.com/apache/incubator-dolphinscheduler/issues/4984

Best Regards

--
DolphinScheduler(Incubator) Contributor
Shiwen Cheng 程世文
Mobile: (+86)15201523580
Email: [email protected]

Reply via email to