Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Zhu Zhu Mon, 13 Nov 2023 02:54:55 -0800

Hi Rui,
Thanks for creating this FLIP and sorry for jumping in so late into the
discussion.


The improvements of exponential-delay strategy and making it the default
strategy looks good it me in general. I have some comments for it, as well
as for the failure counting.

1. default values of exponential-delay configuration
It is mentioned in the FLIP that "the default value of these options are
not changed, we just change them for default restart-strategy. The default
restart-strategy just takes effect if checkpointing is enabled and the
user doesn’t configure the restart-strategy". I'm a bit concerned with
the inconsistency which may cause confusion to users. So if the proposed
configuration values work better in most cases, I'm leaning towards to
change the default values.

2. the default tolerable failure count
Currently, the restart-strategy behaves like this by default(if not
configured):
1. job retries on failures indefinitely if checkpointing is enabled
2. job fails on any failure if checkpointing is disabled
Changing it to always restart indefinitely may result in unexpected
behaviors in production. Therefore, either we should do the same
thing to exponential-delay, or have an open discussion(also involving
the user ML) on changing this default behavior.

3. failure counting
Flink currently will try to recognize concurrent failures and group them
together, which can be seen in the web UI. So how about to align the
failure counting with the concurrent failures computing? This can make it
more consistent and easier for understanding. It will require changes to
the concurrent failures computing though, i.e. taking the backoff time
into consideration. So maybe we can open a seperate FLIP for this change.

Thanks,
Zhu

Rui Fan <1996fan...@gmail.com> 于2023年11月10日周五 18:22写道：

> I'll start voting next Monday if there isn't any other comment.
>
> Best,
> Rui
>
> On Thu, Oct 19, 2023 at 6:59 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi Konstantin and Max,
> >
> > Thanks for your feedback!
> >
> > Sorry, I forgot to mention the default value of
> > `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`.
> >
> > Retrying forever sounds good to me, I have added it to the FLIP:
> >
> > The default value of
> > `restart-strategy.exponential-delay.max-attempts-before-reset-backoff` is
> > Integer.MAX_VALUE.
> >
> > Best,
> > Rui
> >
> > On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels <m...@apache.org>
> wrote:
> >
> >> Hey Rui,
> >>
> >> +1 for making exponential backoff the default. I agree with Konstantin
> >> that retrying forever is a good default for exponential backoff
> >> because oftentimes the issue will resolve eventually. The purpose of
> >> exponential backoff is precisely to continue to retry without causing
> >> too much load. However, I'm not against adding an optional max number
> >> of retries.
> >>
> >> -Max
> >>
> >> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf <kna...@apache.org>
> >> wrote:
> >> >
> >> > Hi Rui,
> >> >
> >> > Thank you for this proposal and working on this. I also agree that
> >> > exponential back off makes sense as a new default in general. I think
> >> > restarting indefinitely (no max attempts) makes sense by default,
> >> though,
> >> > but of course allowing users to change is valuable.
> >> >
> >> > So, overall +1.
> >> >
> >> > Cheers,
> >> >
> >> > Konstantin
> >> >
> >> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <
> >> 1996fan...@gmail.com>:
> >> >
> >> > > Hi all,
> >> > >
> >> > > I would like to start a discussion on FLIP-364: Improve the
> >> > > restart-strategy[1]
> >> > >
> >> > > As we know, the restart-strategy is critical for flink jobs, it
> mainly
> >> > > has two functions:
> >> > > 1. When an exception occurs in the flink job, quickly restart the
> job
> >> > > so that the job can return to the running state.
> >> > > 2. When a job cannot be recovered after frequent restarts within
> >> > > a certain period of time, Flink will not retry but will fail the
> job.
> >> > >
> >> > > The current restart-strategy support for function 2 has some issues:
> >> > > 1. The exponential-delay doesn't have the max attempts mechanism,
> >> > > it means that flink will restart indefinitely even if it fails
> >> frequently.
> >> > > 2. For multi-region streaming jobs and all batch jobs, the failure
> of
> >> > > each region will increase the total number of job failures by +1,
> >> > > even if these failures occur at the same time. If the number of
> >> > > failures increases too quickly, it will be difficult to set a
> >> reasonable
> >> > > number of retries.
> >> > > If the maximum number of failures is set too low, the job can easily
> >> > > reach the retry limit, causing the job to fail. If set too high,
> some
> >> jobs
> >> > > will never fail.
> >> > >
> >> > > In addition, when the above two problems are solved, we can also
> >> > > discuss whether exponential-delay can replace fixed-delay as the
> >> > > default restart-strategy. In theory, exponential-delay is smarter
> and
> >> > > friendlier than fixed-delay.
> >> > >
> >> > > I also thank Zhu Zhu for his suggestions on the option name in
> >> > > FLINK-32895[2] in advance.
> >> > >
> >> > > Looking forward to and welcome everyone's feedback and suggestions,
> >> thank
> >> > > you.
> >> > >
> >> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> >> > > [2] https://issues.apache.org/jira/browse/FLINK-32895
> >> > >
> >> > > Best,
> >> > > Rui
> >> > >
> >> >
> >> >
> >> > --
> >> > https://twitter.com/snntrable
> >> > https://github.com/knaufk
> >>
> >
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Reply via email to