Thanks Xiangyu for the quick update! LGTM
Best, Rui On Mon, Jan 8, 2024 at 4:27 PM xiangyu feng <xiangyu...@gmail.com> wrote: > Hi Rui and Yong, > > Thx for ur reply. > > My initial attention here is that for short-lived jobs under high QPS: a > fixed delay retry strategy will cause extra resource waste and not flexible > enough, an exponential-backoff strategy might significantly increase the > query latency since the interval time grows too fast. An incremental-delay > strategy could be balanced between resource consumption and short-query > latency. > > With a second thought, an exponential-delay retry strategy with a > configurable multiplier option can also achieve this goal. By setting the > default value of multiplier to 1, we can be consistent with the original > behavior and reduce the configuration items at the same time. > > I've updated this FLIP accordingly, look forward to your feedback. > > Regards, > Xiangyu Feng > > > Rui Fan <1996fan...@gmail.com> 于2024年1月8日周一 15:29写道: > >> Only one strategy is fine to me. >> >> When the multiplier is set to 1, the exponential-delay will become >> fixed-delay. >> So fixed-delay may not be needed. >> >> Best, >> Rui >> >> On Mon, Jan 8, 2024 at 2:17 PM Yong Fang <zjur...@gmail.com> wrote: >> >> > I agree with @Rui that the current configuration for Flink Client is a >> > little complex. Can we just provide one strategy with less configuration >> > items for all scenarios? >> > >> > Best, >> > Fang Yong >> > >> > On Mon, Jan 8, 2024 at 11:19 AM Rui Fan <1996fan...@gmail.com> wrote: >> > >> > > Thanks xiangyu for driving this proposal! And sorry for the >> > > late reply. >> > > >> > > Overall looks good to me, I only have some minor questions: >> > > >> > > 1. Do we need to introduce 3 collect strategies in the first version? >> > > >> > > Large and comprehensive configuration items will bring >> > > additional learning costs and usage costs to users. I tend to >> > > provide users with out-of-the-box parameters and 2 collect >> > > strategies may be enough for users. >> > > >> > > IIUC, there is no big difference between exponential-delay and >> > > incremental-delay, especially the default parameters provided. >> > > I wonder could we provide a multiplier for exponential-delay strategy >> > > and removing the incremental-delay strategy? >> > > >> > > Of course, if you think multiplier option is not needed based on >> > > your production experience, it's totally fine for me. Simple is >> better. >> > > >> > > 2. Which strategy do you think is best in mass production? >> > > >> > > I'm working on FLIP-364[1], it's related to Flink failover restart >> > > strategy. IIUC, when one cluster only has a few flink jobs, >> > > fixed-delay is fine. It guarantees minimal latency without too >> > > much stress. But if one cluster has too many jobs, fixed-delay >> > > may not be stable. >> > > >> > > Do you think exponential-delay is better than fixed delay in this >> > > scenario? And which strategy is used in your production for now? >> > > Would you mind sharing it? >> > > >> > > Looking forwarding to your opinion~ >> > > >> > > Best, >> > > Rui >> > > >> > > On Sat, Jan 6, 2024 at 5:54 PM xiangyu feng <xiangyu...@gmail.com> >> > wrote: >> > > >> > > > Hi all, >> > > > >> > > > Thanks for the comments. >> > > > >> > > > If there is no further comment, we will open the voting thread next >> > week. >> > > > >> > > > Regards, >> > > > Xiangyu >> > > > >> > > > Zhanghao Chen <zhanghao.c...@outlook.com> 于2024年1月3日周三 16:46写道: >> > > > >> > > > > Thanks for driving this effort on improving the interactive use >> > > > experience >> > > > > of Flink. The proposal overall looks good to me. >> > > > > >> > > > > Best, >> > > > > Zhanghao Chen >> > > > > ________________________________ >> > > > > From: xiangyu feng <xiangyu...@gmail.com> >> > > > > Sent: Tuesday, December 26, 2023 16:51 >> > > > > To: dev@flink.apache.org <dev@flink.apache.org> >> > > > > Subject: [Discuss] FLIP-407: Improve Flink Client performance in >> > > > > interactive scenarios >> > > > > >> > > > > Hi devs, >> > > > > >> > > > > I'm opening this thread to discuss FLIP-407: Improve Flink Client >> > > > > performance in interactive scenarios. The POC test results and >> design >> > > doc >> > > > > can be found at: FLIP-407 >> > > > > < >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+when+interacting+with+dedicated+Flink+Session+Clusters >> > > > > > >> > > > > . >> > > > > >> > > > > Currently, Flink Client is mainly designed for one time >> interaction >> > > with >> > > > > the Flink Cluster. All the resources(http connections, threads, ha >> > > > > services) and instances(ClusterDescriptor, ClusterClient, >> RestClient) >> > > are >> > > > > created and recycled for each interaction. This works well when >> users >> > > do >> > > > > not need to interact frequently with Flink Cluster and also saves >> > > > resource >> > > > > usage since resources are recycled immediately after each usage. >> > > > > >> > > > > However, in OLAP or StreamingWarehouse scenarios, users might >> submit >> > > > > interactive jobs to a dedicated Flink Session Cluster very often. >> In >> > > this >> > > > > case, we find that for short queries that can finish in less than >> 1s >> > in >> > > > > Flink Cluster will still have E2E latency greater than 2s. Hence, >> we >> > > > > propose this FLIP to improve the Flink Client performance in this >> > > > scenario. >> > > > > This could also improve the user experience when using session >> debug >> > > > mode. >> > > > > >> > > > > The major change in this FLIP is that there will be a new >> introduced >> > > > option >> > > > > *'execution.interactive-client'*. When this option is enabled, >> Flink >> > > > > Client will reuse all the necessary resources to improve >> interactive >> > > > > performance, including: HA Services, HTTP connections, threads and >> > all >> > > > > kinds of instances related to a long-running Flink Cluster. The >> > default >> > > > > value of this option will be false, then Flink Client will behave >> as >> > > > > before. >> > > > > >> > > > > Also, this FLIP proposed a configurable RetryStrategy when >> fetching >> > > > results >> > > > > from client-side to Flink Cluster. In interactive scenarios, this >> can >> > > > save >> > > > > more than 15% of TM CPU usage without performance degradation. >> > > > > >> > > > > Looking forward to your feedback, thanks. >> > > > > >> > > > > Best regards, >> > > > > Xiangyu >> > > > > >> > > > >> > > >> > >> >