I think you can check the runtime log to find some warn/error message in master server and worker server when you received the hung up alarm.
Best Regards --------------- Apache DolphinScheduler PMC Chair LidongDai [email protected] Linkedin: https://www.linkedin.com/in/dailidong Twitter: @WorkflowEasy <https://twitter.com/WorkflowEasy> --------------- On Mon, Nov 22, 2021 at 10:54 AM 王峰 <[email protected]> wrote: > 3 nodes, 2master/worker are all on the same machine, there is no downtime, > but the server service has hung up the alarm. I guess that insufficient > machine resources have affected the operation of the server, and fault > tolerance has occurred. The actual task after the error identification is > returned It did not stop, and a new task instance was started on the new > server. > > > > > > > > > > > > > > > > > > At 2021-11-21 18:41:49, "Lidong Dai" <[email protected]> wrote: > >hi, > >can you describe the question clearly? the host load means the Master > >or the Worker server? is there any server down? > > > >Best Regards > > > > > > > >--------------- > >Apache DolphinScheduler PMC Chair > >LidongDai > >[email protected] > >Linkedin: https://www.linkedin.com/in/dailidong > >Twitter: @WorkflowEasy > >--------------- > > > >On Sun, Nov 21, 2021 at 3:59 PM 王峰 <[email protected]> wrote: > >> > >> doplhinscheduler 1.3.3 cluster > >> > >> > >> > >> > >> There is such a scenario, because the host load is too high, master > fault tolerance may occur in the middle, and the same workflow instance is > run twice (two tasks are parallel in time), which causes the data to double. >
