Hi everyone, Thank you for all the feedback on this FLIP! I will open a vote for it since there is no more concern.
Thanks, Zhu Zhu Zhu <reed...@gmail.com> 于2022年5月11日周三 12:29写道: > > Hi everyone, > > According to the discussion and updates of the blocklist > mechanism[1] (FLIP-224), I have updated FLIP-168 to make > decision on itself to block identified slow nodes. A new > configuration is also added to control how long a slow > node should be blocked. > > [1] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h > > Thanks, > Zhu > > Zhu Zhu <reed...@gmail.com> 于2022年4月29日周五 14:36写道: > > > > Thank you for all the feedback! > > > > @Guowei Ma > > Here's my thoughts for your questions: > > >> 1. How to judge whether the Execution Vertex belongs to a slow task. > > If a slow task fails and gets restarted, it may not be a slow task > > anymore. Especially given that the nodes of the slow task may have been > > blacklisted and the new task will be deployed to a new node. I think we > > should again go through the slow task detection process to determine > > whether it is a slow task. I agree that it is not ideal to take another > > 59 mins to identify a slow task. To solve this problem, one idea is to > > introduce a slow task detection strategy which identifies slow tasks > > according to the throughput. This approach needs more thoughts and > > experiments so we now target it to a future time. > > > > >> 2. The fault tolerance strategy and the Slow task detection strategy are > > >> coupled > > I don't think the fault tolerance and slow task detecting are coupled. > > If a task fails while the ExecutionVertex still has a task in progress, > > there is no need to start new executions for the vertex in the perspective > > of fault tolerance. If the remaining task is slow, in the next slow task > > detecting, a speculative execution will be created and deployed for it. > > This, however, is a normal speculative execution process rather than a > > failure recovery process. In this way, the fault tolerance and slow task > > detecting work without knowing each other and the job can still recover > > from failures and guarantee there are speculative executions for slow tasks. > > > > >> 3. Default value of > > >> `slow-task-detector.execution-time.baseline-lower-bound` is too small > > From what I see in production and knowing from users, there are many > > batch jobs of a relatively small scale (a few terabytes, hundreds of > > gigabytes). Tasks of these jobs can finish in minutes, so that a > > `1 min` lowbound is large enough. Besides that, I think the out-of-box > > experience is more important for users running small scale jobs. > > > > Thanks, > > Zhu > > > > Guowei Ma <guowei....@gmail.com> 于2022年4月28日周四 17:55写道: > >> > >> Hi, zhu > >> > >> Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think > >> it's ok, I just have 3 small questions > >> > >> 1. How to judge whether the Execution Vertex belongs to a slow task. > >> The current calculation method is: the current timestamp minus the > >> timestamp of the execution deployment. If the execution time of this > >> execution exceeds the baseline, then it is judged as a slow task. Normally > >> this is no problem. But if an execution fails, the time may not be > >> accurate. For example, the baseline is 59 minutes, and a task fails after > >> 56 minutes of execution. In the worst case, it may take an additional 59 > >> minutes to discover that the task is a slow task. > >> > >> 2. Speculative Scheduler's fault tolerance strategy. > >> The strategy in FLIP is: if the Execution Vertex can be executed, even if > >> the execution fails, the fault tolerance strategy will not be adopted. > >> Although currently `ExecutionTimeBasedSlowTaskDetector` can restart an > >> execution. But isn't this dependency a bit too strong? To some extent, the > >> fault tolerance strategy and the Slow task detection strategy are coupled > >> together. > >> > >> > >> 3. The value of the default configuration > >> IMHO, prediction execution should only be required for relatively > >> large-scale, very time-consuming and long-term jobs. > >> If `slow-task-detector.execution-time.baseline-lower-bound` is too small, > >> is it possible for the system to always start some additional tasks that > >> have little effect? In the end, the user needs to reset this default > >> configuration. Is it possible to consider a larger configuration. Of > >> course, this part is best to listen to the suggestions of other community > >> users. > >> > >> Best, > >> Guowei > >> > >> > >> On Thu, Apr 28, 2022 at 3:54 PM Jiangang Liu <liujiangangp...@gmail.com> > >> wrote: > >> > >> > +1 for the feature. > >> > > >> > Mang Zhang <zhangma...@163.com> 于2022年4月28日周四 11:36写道: > >> > > >> > > Hi zhu: > >> > > > >> > > > >> > > This sounds like a great job! Thanks for your great job. > >> > > In our company, there are already some jobs using Flink Batch, > >> > > but everyone knows that the offline cluster has a lot more load > >> > > than > >> > > the online cluster, and the failure rate of the machine is also much > >> > higher. > >> > > If this work is done, we'd love to use it, it's simply awesome for > >> > our > >> > > flink users. > >> > > thanks again! > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > -- > >> > > > >> > > Best regards, > >> > > Mang Zhang > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote: > >> > > >Hi everyone, > >> > > > > >> > > >More and more users are running their batch jobs on Flink nowadays. > >> > > >One major problem they encounter is slow tasks running on hot/bad > >> > > >nodes, resulting in very long and uncontrollable execution time of > >> > > >batch jobs. This problem is a pain or even unacceptable in > >> > > >production. Many users have been asking for a solution for it. > >> > > > > >> > > >Therefore, I'd like to revive the discussion of speculative > >> > > >execution to solve this problem. > >> > > > > >> > > >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline > >> > > >discussions to refine the design[1]. We also implemented a PoC[2] > >> > > >and verified it using TPC-DS benchmarks and production jobs. > >> > > > > >> > > >Looking forward to your feedback! > >> > > > > >> > > >Thanks, > >> > > >Zhu > >> > > > > >> > > >[1] > >> > > > > >> > > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job > >> > > >[2] > >> > > https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc > >> > > > > >> > > > > >> > > >刘建刚 <liujiangangp...@gmail.com> 于2021年12月13日周一 11:38写道: > >> > > > > >> > > >> Any progress on the feature? We have the same requirement in our > >> > > company. > >> > > >> Since the soft and hard environment can be complex, it is normal to > >> > see > >> > > a > >> > > >> slow task which determines the execution time of the flink job. > >> > > >> > >> > > >> <wangw...@sina.cn> 于2021年6月20日周日 22:35写道: > >> > > >> > >> > > >> > Hi everyone, > >> > > >> > > >> > > >> > I would like to kick off a discussion on speculative execution for > >> > > batch > >> > > >> > job. > >> > > >> > I have created FLIP-168 [1] that clarifies our motivation to do > >> > > >> > this > >> > > and > >> > > >> > some improvement proposals for the new design. > >> > > >> > It would be great to resolve the problem of long tail task in > >> > > >> > batch > >> > > job. > >> > > >> > Please let me know your thoughts. Thanks. > >> > > >> > Regards, > >> > > >> > wangwj > >> > > >> > [1] > >> > > >> > > >> > > >> > >> > > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job > >> > > >> > > >> > > >> > >> > > > >> >