Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Zhu Zhu Thu, 26 May 2022 00:53:25 -0700

Hi everyone,

Thank you for all the feedback on this FLIP!
I will open a vote for it since there is no more concern.


Thanks,
Zhu

Zhu Zhu <reed...@gmail.com> 于2022年5月11日周三 12:29写道：
>
> Hi everyone,
>
> According to the discussion and updates of the blocklist
> mechanism[1] (FLIP-224), I have updated FLIP-168 to make
> decision on itself to block identified slow nodes. A new
> configuration is also added to control how long a slow
> node should be blocked.
>
> [1] https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
>
> Thanks,
> Zhu
>
> Zhu Zhu <reed...@gmail.com> 于2022年4月29日周五 14:36写道：
> >
> > Thank you for all the feedback!
> >
> > @Guowei Ma
> > Here's my thoughts for your questions:
> > >> 1. How to judge whether the Execution Vertex belongs to a slow task.
> > If a slow task fails and gets restarted, it may not be a slow task
> > anymore. Especially given that the nodes of the slow task may have been
> > blacklisted and the new task will be deployed to a new node. I think we
> > should again go through the slow task detection process to determine
> > whether it is a slow task. I agree that it is not ideal to take another
> > 59 mins to identify a slow task. To solve this problem, one idea is to
> > introduce a slow task detection strategy which identifies slow tasks
> > according to the throughput. This approach needs more thoughts and
> > experiments so we now target it to a future time.
> >
> > >> 2. The fault tolerance strategy and the Slow task detection strategy are 
> > >> coupled
> > I don't think the fault tolerance and slow task detecting are coupled.
> > If a task fails while the ExecutionVertex still has a task in progress,
> > there is no need to start new executions for the vertex in the perspective
> > of fault tolerance. If the remaining task is slow, in the next slow task
> > detecting, a speculative execution will be created and deployed for it.
> > This, however, is a normal speculative execution process rather than a
> > failure recovery process. In this way, the fault tolerance and slow task
> > detecting work without knowing each other and the job can still recover
> > from failures and guarantee there are speculative executions for slow tasks.
> >
> > >> 3. Default value of 
> > >> `slow-task-detector.execution-time.baseline-lower-bound` is too small
> > From what I see in production and knowing from users, there are many
> > batch jobs of a relatively small scale (a few terabytes, hundreds of
> > gigabytes). Tasks of these jobs can finish in minutes, so that a
> > `1 min` lowbound is large enough. Besides that, I think the out-of-box
> > experience is more important for users running small scale jobs.
> >
> > Thanks,
> > Zhu
> >
> > Guowei Ma <guowei....@gmail.com> 于2022年4月28日周四 17:55写道：
> >>
> >> Hi, zhu
> >>
> >> Many thanks to zhuzhu for initiating the FLIP discussion. Overall I think
> >> it's ok, I just have 3 small questions
> >>
> >> 1. How to judge whether the Execution Vertex belongs to a slow task.
> >> The current calculation method is: the current timestamp minus the
> >> timestamp of the execution deployment. If the execution time of this
> >> execution exceeds the baseline, then it is judged as a slow task. Normally
> >> this is no problem. But if an execution fails, the time may not be
> >> accurate. For example, the baseline is 59 minutes, and a task fails after
> >> 56 minutes of execution. In the worst case, it may take an additional 59
> >> minutes to discover that the task is a slow task.
> >>
> >> 2. Speculative Scheduler's fault tolerance strategy.
> >> The strategy in FLIP is: if the Execution Vertex can be executed, even if
> >> the execution fails, the fault tolerance strategy will not be adopted.
> >> Although currently `ExecutionTimeBasedSlowTaskDetector` can restart an
> >> execution. But isn't this dependency a bit too strong? To some extent, the
> >> fault tolerance strategy and the Slow task detection strategy are coupled
> >> together.
> >>
> >>
> >> 3. The value of the default configuration
> >> IMHO, prediction execution should only be required for relatively
> >> large-scale, very time-consuming and long-term jobs.
> >> If `slow-task-detector.execution-time.baseline-lower-bound` is too small,
> >> is it possible for the system to always start some additional tasks that
> >> have little effect? In the end, the user needs to reset this default
> >> configuration. Is it possible to consider a larger configuration. Of
> >> course, this part is best to listen to the suggestions of other community
> >> users.
> >>
> >> Best,
> >> Guowei
> >>
> >>
> >> On Thu, Apr 28, 2022 at 3:54 PM Jiangang Liu <liujiangangp...@gmail.com>
> >> wrote:
> >>
> >> > +1 for the feature.
> >> >
> >> > Mang Zhang <zhangma...@163.com> 于2022年4月28日周四 11:36写道：
> >> >
> >> > > Hi zhu:
> >> > >
> >> > >
> >> > >     This sounds like a great job! Thanks for your great job.
> >> > >     In our company, there are already some jobs using Flink Batch,
> >> > >     but everyone knows that the offline cluster has a lot more load 
> >> > > than
> >> > > the online cluster, and the failure rate of the machine is also much
> >> > higher.
> >> > >     If this work is done, we'd love to use it, it's simply awesome for
> >> > our
> >> > > flink users.
> >> > >     thanks again!
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > Best regards,
> >> > > Mang Zhang
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > At 2022-04-27 10:46:06, "Zhu Zhu" <zh...@apache.org> wrote:
> >> > > >Hi everyone,
> >> > > >
> >> > > >More and more users are running their batch jobs on Flink nowadays.
> >> > > >One major problem they encounter is slow tasks running on hot/bad
> >> > > >nodes, resulting in very long and uncontrollable execution time of
> >> > > >batch jobs. This problem is a pain or even unacceptable in
> >> > > >production. Many users have been asking for a solution for it.
> >> > > >
> >> > > >Therefore, I'd like to revive the discussion of speculative
> >> > > >execution to solve this problem.
> >> > > >
> >> > > >Weijun Wang, Jing Zhang, Lijie Wang and I had some offline
> >> > > >discussions to refine the design[1]. We also implemented a PoC[2]
> >> > > >and verified it using TPC-DS benchmarks and production jobs.
> >> > > >
> >> > > >Looking forward to your feedback!
> >> > > >
> >> > > >Thanks,
> >> > > >Zhu
> >> > > >
> >> > > >[1]
> >> > > >
> >> > >
> >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >> > > >[2]
> >> > > https://github.com/zhuzhurk/flink/commits/1.14-speculative-execution-poc
> >> > > >
> >> > > >
> >> > > >刘建刚 <liujiangangp...@gmail.com> 于2021年12月13日周一 11:38写道：
> >> > > >
> >> > > >> Any progress on the feature? We have the same requirement in our
> >> > > company.
> >> > > >> Since the soft and hard environment can be complex, it is normal to
> >> > see
> >> > > a
> >> > > >> slow task which determines the execution time of the flink job.
> >> > > >>
> >> > > >> <wangw...@sina.cn> 于2021年6月20日周日 22:35写道：
> >> > > >>
> >> > > >> > Hi everyone,
> >> > > >> >
> >> > > >> > I would like to kick off a discussion on speculative execution for
> >> > > batch
> >> > > >> > job.
> >> > > >> > I have created FLIP-168 [1] that clarifies our motivation to do 
> >> > > >> > this
> >> > > and
> >> > > >> > some improvement proposals for the new design.
> >> > > >> > It would be great to resolve the problem of long tail task in 
> >> > > >> > batch
> >> > > job.
> >> > > >> > Please let me know your thoughts. Thanks.
> >> > > >> >   Regards,
> >> > > >> > wangwj
> >> > > >> > [1]
> >> > > >> >
> >> > > >>
> >> > >
> >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> >> > > >> >
> >> > > >>
> >> > >
> >> >

Re: Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

Reply via email to