davidzollo commented on issue #11352:
URL: 
https://github.com/apache/dolphinscheduler/issues/11352#issuecomment-2078724177

   Now the question:
   1) There is no checkpoint/savepoint management;
   2) If the ds service is republished/restarted, the logic is to kill the old 
task first and then reschedule the new task (Is this logic reasonable for the 
stream task? If the new task is rescheduled, it should also be started from the 
latest ckp, otherwise data will be lost)
   3) When submitting a flink stream task, you need to ensure that the local 
pid does not exit. If the pid exits abnormally, the flink task cannot be killed 
(the current ds kill task is based on memory, and if the pid exits, the memory 
of the task will be cleared)
   4) Lack of logic to pull up the task if the task fails abnormally. (For 
example, after the yarn cluster hangs and recovers, will the tasks be 
rescheduled? Is it necessary to add new task status detection logic?)
   
   Extensions:
   Currently, flink task submission is done through the shell, and whether the 
follow-up is based on the shell or the API
   
   ---------------
   
   现在的问题:
   1)没有对checkpoint/savepoint管理;
   
2)ds服务如果重新发布/重启,逻辑是先kill老任务,再重新调度新任务(这个逻辑对于stream任务是否合理?如果重新调度新任务也应该从最新的ckp启动,不然会丢数据)
   3)提交flink stream任务需要保证本地的pid不退出才行,如果pid异常退出,会导致flink任务无法kill(目前ds 
kill任务是基于内存做的,如果pid退出会清掉任务的内存)
   4)缺少任务异常失败,拉起任务的逻辑。(比如yarn集群挂掉恢复后,任务是否会重新调度?是否需要新增任务状态检测的逻辑?)
   扩展:
   目前flink任务提交是通过shell做的,后续是基于shell还是api
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to