Re: [Discuss] FLIP-407: Improve Flink Client performance in interactive scenarios

xiangyu feng Thu, 11 Jan 2024 03:42:25 -0800

Hi devs,

Thanks for all the feedback. If there are no more comments, I would like to
start a vote for this FLIP, thanks again!


Best,
Xiangyu Feng

Weihua Hu <[email protected]> 于2024年1月9日周二 14:45写道：

> Thanks for proposing this FLIP.
>
> Experiments have shown that it significantly enhances the real-time query
> experience.
> +1 for this.
>
> Best,
> Weihua
>
>
> On Mon, Jan 8, 2024 at 5:19 PM Rui Fan <[email protected]> wrote:
>
>> Thanks Xiangyu for the quick update!
>>
>> LGTM
>>
>> Best,
>> Rui
>>
>> On Mon, Jan 8, 2024 at 4:27 PM xiangyu feng <[email protected]> wrote:
>>
>> > Hi Rui and Yong,
>> >
>> > Thx for ur reply.
>> >
>> > My initial attention here is that for short-lived jobs under high QPS: a
>> > fixed delay retry strategy will cause extra resource waste and not
>> flexible
>> > enough, an exponential-backoff strategy might significantly increase the
>> > query latency since the interval time grows too fast. An
>> incremental-delay
>> > strategy could be balanced between resource consumption and short-query
>> > latency.
>> >
>> > With a second thought,  an exponential-delay retry strategy with a
>> > configurable multiplier option can also achieve this goal. By setting
>> the
>> > default value of multiplier to 1, we can be consistent with the original
>> > behavior and reduce the configuration items at the same time.
>> >
>> > I've updated this FLIP accordingly, look forward to your feedback.
>> >
>> > Regards,
>> > Xiangyu Feng
>> >
>> >
>> > Rui Fan <[email protected]> 于2024年1月8日周一 15:29写道：
>> >
>> >> Only one strategy is fine to me.
>> >>
>> >> When the multiplier is set to 1, the exponential-delay will become
>> >> fixed-delay.
>> >> So fixed-delay may not be needed.
>> >>
>> >> Best,
>> >> Rui
>> >>
>> >> On Mon, Jan 8, 2024 at 2:17 PM Yong Fang <[email protected]> wrote:
>> >>
>> >> > I agree with @Rui that the current configuration for Flink Client is
>> a
>> >> > little complex. Can we just provide one strategy with less
>> configuration
>> >> > items for all scenarios?
>> >> >
>> >> > Best,
>> >> > Fang Yong
>> >> >
>> >> > On Mon, Jan 8, 2024 at 11:19 AM Rui Fan <[email protected]>
>> wrote:
>> >> >
>> >> > > Thanks xiangyu for driving this proposal! And sorry for the
>> >> > > late reply.
>> >> > >
>> >> > > Overall looks good to me, I only have some minor questions:
>> >> > >
>> >> > > 1. Do we need to introduce 3 collect strategies in the first
>> version?
>> >> > >
>> >> > > Large and comprehensive configuration items will bring
>> >> > > additional learning costs and usage costs to users. I tend to
>> >> > > provide users with out-of-the-box parameters and 2 collect
>> >> > > strategies may be enough for users.
>> >> > >
>> >> > > IIUC, there is no big difference between exponential-delay and
>> >> > > incremental-delay, especially the default parameters provided.
>> >> > > I wonder could we provide a multiplier for exponential-delay
>> strategy
>> >> > > and removing the incremental-delay strategy?
>> >> > >
>> >> > > Of course, if you think multiplier option is not needed based on
>> >> > > your production experience, it's totally fine for me. Simple is
>> >> better.
>> >> > >
>> >> > > 2. Which strategy do you think is best in mass production?
>> >> > >
>> >> > > I'm working on FLIP-364[1], it's related to Flink failover restart
>> >> > > strategy. IIUC, when one cluster only has a few flink jobs,
>> >> > > fixed-delay is fine. It guarantees minimal latency without too
>> >> > > much stress. But if one cluster has too many jobs, fixed-delay
>> >> > > may not be stable.
>> >> > >
>> >> > > Do you think exponential-delay is better than fixed delay in this
>> >> > > scenario? And which strategy is used in your production for now?
>> >> > > Would you mind sharing it?
>> >> > >
>> >> > > Looking forwarding to your opinion~
>> >> > >
>> >> > > Best,
>> >> > > Rui
>> >> > >
>> >> > > On Sat, Jan 6, 2024 at 5:54 PM xiangyu feng <[email protected]>
>> >> > wrote:
>> >> > >
>> >> > > > Hi all,
>> >> > > >
>> >> > > > Thanks for the comments.
>> >> > > >
>> >> > > > If there is no further comment, we will open the voting thread
>> next
>> >> > week.
>> >> > > >
>> >> > > > Regards,
>> >> > > > Xiangyu
>> >> > > >
>> >> > > > Zhanghao Chen <[email protected]> 于2024年1月3日周三 16:46写道：
>> >> > > >
>> >> > > > > Thanks for driving this effort on improving the interactive use
>> >> > > > experience
>> >> > > > > of Flink. The proposal overall looks good to me.
>> >> > > > >
>> >> > > > > Best,
>> >> > > > > Zhanghao Chen
>> >> > > > > ________________________________
>> >> > > > > From: xiangyu feng <[email protected]>
>> >> > > > > Sent: Tuesday, December 26, 2023 16:51
>> >> > > > > To: [email protected] <[email protected]>
>> >> > > > > Subject: [Discuss] FLIP-407: Improve Flink Client performance
>> in
>> >> > > > > interactive scenarios
>> >> > > > >
>> >> > > > > Hi devs,
>> >> > > > >
>> >> > > > > I'm opening this thread to discuss FLIP-407: Improve Flink
>> Client
>> >> > > > > performance in interactive scenarios. The POC test results and
>> >> design
>> >> > > doc
>> >> > > > > can be found at: FLIP-407
>> >> > > > > <
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+when+interacting+with+dedicated+Flink+Session+Clusters
>> >> > > > > >
>> >> > > > > .
>> >> > > > >
>> >> > > > > Currently, Flink Client is mainly designed for one time
>> >> interaction
>> >> > > with
>> >> > > > > the Flink Cluster. All the resources(http connections,
>> threads, ha
>> >> > > > > services) and instances(ClusterDescriptor, ClusterClient,
>> >> RestClient)
>> >> > > are
>> >> > > > > created and recycled for each interaction. This works well when
>> >> users
>> >> > > do
>> >> > > > > not need to interact frequently with Flink Cluster and also
>> saves
>> >> > > > resource
>> >> > > > > usage since resources are recycled immediately after each
>> usage.
>> >> > > > >
>> >> > > > > However, in OLAP or StreamingWarehouse scenarios, users might
>> >> submit
>> >> > > > > interactive jobs to a dedicated Flink Session Cluster very
>> often.
>> >> In
>> >> > > this
>> >> > > > > case, we find that for short queries that can finish in less
>> than
>> >> 1s
>> >> > in
>> >> > > > > Flink Cluster will still have E2E latency greater than 2s.
>> Hence,
>> >> we
>> >> > > > > propose this FLIP to improve the Flink Client performance in
>> this
>> >> > > > scenario.
>> >> > > > > This could also improve the user experience when using session
>> >> debug
>> >> > > > mode.
>> >> > > > >
>> >> > > > > The major change in this FLIP is that there will be a new
>> >> introduced
>> >> > > > option
>> >> > > > > *'execution.interactive-client'*. When this option is enabled,
>> >> Flink
>> >> > > > > Client will reuse all the necessary resources to improve
>> >> interactive
>> >> > > > > performance, including: HA Services, HTTP connections, threads
>> and
>> >> > all
>> >> > > > > kinds of instances related to a long-running Flink Cluster. The
>> >> > default
>> >> > > > > value of this option will be false, then Flink Client will
>> behave
>> >> as
>> >> > > > > before.
>> >> > > > >
>> >> > > > > Also, this FLIP proposed a configurable RetryStrategy when
>> >> fetching
>> >> > > > results
>> >> > > > > from client-side to Flink Cluster. In interactive scenarios,
>> this
>> >> can
>> >> > > > save
>> >> > > > > more than 15% of TM CPU usage without performance degradation.
>> >> > > > >
>> >> > > > > Looking forward to your feedback, thanks.
>> >> > > > >
>> >> > > > > Best regards,
>> >> > > > > Xiangyu
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>>
>

Re: [Discuss] FLIP-407: Improve Flink Client performance in interactive scenarios

Reply via email to