Hi devs, Thanks for all the feedback. If there are no more comments, I would like to start a vote for this FLIP, thanks again!
Best, Xiangyu Feng Weihua Hu <huweihua....@gmail.com> 于2024年1月9日周二 14:45写道: > Thanks for proposing this FLIP. > > Experiments have shown that it significantly enhances the real-time query > experience. > +1 for this. > > Best, > Weihua > > > On Mon, Jan 8, 2024 at 5:19 PM Rui Fan <1996fan...@gmail.com> wrote: > >> Thanks Xiangyu for the quick update! >> >> LGTM >> >> Best, >> Rui >> >> On Mon, Jan 8, 2024 at 4:27 PM xiangyu feng <xiangyu...@gmail.com> wrote: >> >> > Hi Rui and Yong, >> > >> > Thx for ur reply. >> > >> > My initial attention here is that for short-lived jobs under high QPS: a >> > fixed delay retry strategy will cause extra resource waste and not >> flexible >> > enough, an exponential-backoff strategy might significantly increase the >> > query latency since the interval time grows too fast. An >> incremental-delay >> > strategy could be balanced between resource consumption and short-query >> > latency. >> > >> > With a second thought, an exponential-delay retry strategy with a >> > configurable multiplier option can also achieve this goal. By setting >> the >> > default value of multiplier to 1, we can be consistent with the original >> > behavior and reduce the configuration items at the same time. >> > >> > I've updated this FLIP accordingly, look forward to your feedback. >> > >> > Regards, >> > Xiangyu Feng >> > >> > >> > Rui Fan <1996fan...@gmail.com> 于2024年1月8日周一 15:29写道: >> > >> >> Only one strategy is fine to me. >> >> >> >> When the multiplier is set to 1, the exponential-delay will become >> >> fixed-delay. >> >> So fixed-delay may not be needed. >> >> >> >> Best, >> >> Rui >> >> >> >> On Mon, Jan 8, 2024 at 2:17 PM Yong Fang <zjur...@gmail.com> wrote: >> >> >> >> > I agree with @Rui that the current configuration for Flink Client is >> a >> >> > little complex. Can we just provide one strategy with less >> configuration >> >> > items for all scenarios? >> >> > >> >> > Best, >> >> > Fang Yong >> >> > >> >> > On Mon, Jan 8, 2024 at 11:19 AM Rui Fan <1996fan...@gmail.com> >> wrote: >> >> > >> >> > > Thanks xiangyu for driving this proposal! And sorry for the >> >> > > late reply. >> >> > > >> >> > > Overall looks good to me, I only have some minor questions: >> >> > > >> >> > > 1. Do we need to introduce 3 collect strategies in the first >> version? >> >> > > >> >> > > Large and comprehensive configuration items will bring >> >> > > additional learning costs and usage costs to users. I tend to >> >> > > provide users with out-of-the-box parameters and 2 collect >> >> > > strategies may be enough for users. >> >> > > >> >> > > IIUC, there is no big difference between exponential-delay and >> >> > > incremental-delay, especially the default parameters provided. >> >> > > I wonder could we provide a multiplier for exponential-delay >> strategy >> >> > > and removing the incremental-delay strategy? >> >> > > >> >> > > Of course, if you think multiplier option is not needed based on >> >> > > your production experience, it's totally fine for me. Simple is >> >> better. >> >> > > >> >> > > 2. Which strategy do you think is best in mass production? >> >> > > >> >> > > I'm working on FLIP-364[1], it's related to Flink failover restart >> >> > > strategy. IIUC, when one cluster only has a few flink jobs, >> >> > > fixed-delay is fine. It guarantees minimal latency without too >> >> > > much stress. But if one cluster has too many jobs, fixed-delay >> >> > > may not be stable. >> >> > > >> >> > > Do you think exponential-delay is better than fixed delay in this >> >> > > scenario? And which strategy is used in your production for now? >> >> > > Would you mind sharing it? >> >> > > >> >> > > Looking forwarding to your opinion~ >> >> > > >> >> > > Best, >> >> > > Rui >> >> > > >> >> > > On Sat, Jan 6, 2024 at 5:54 PM xiangyu feng <xiangyu...@gmail.com> >> >> > wrote: >> >> > > >> >> > > > Hi all, >> >> > > > >> >> > > > Thanks for the comments. >> >> > > > >> >> > > > If there is no further comment, we will open the voting thread >> next >> >> > week. >> >> > > > >> >> > > > Regards, >> >> > > > Xiangyu >> >> > > > >> >> > > > Zhanghao Chen <zhanghao.c...@outlook.com> 于2024年1月3日周三 16:46写道: >> >> > > > >> >> > > > > Thanks for driving this effort on improving the interactive use >> >> > > > experience >> >> > > > > of Flink. The proposal overall looks good to me. >> >> > > > > >> >> > > > > Best, >> >> > > > > Zhanghao Chen >> >> > > > > ________________________________ >> >> > > > > From: xiangyu feng <xiangyu...@gmail.com> >> >> > > > > Sent: Tuesday, December 26, 2023 16:51 >> >> > > > > To: dev@flink.apache.org <dev@flink.apache.org> >> >> > > > > Subject: [Discuss] FLIP-407: Improve Flink Client performance >> in >> >> > > > > interactive scenarios >> >> > > > > >> >> > > > > Hi devs, >> >> > > > > >> >> > > > > I'm opening this thread to discuss FLIP-407: Improve Flink >> Client >> >> > > > > performance in interactive scenarios. The POC test results and >> >> design >> >> > > doc >> >> > > > > can be found at: FLIP-407 >> >> > > > > < >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+when+interacting+with+dedicated+Flink+Session+Clusters >> >> > > > > > >> >> > > > > . >> >> > > > > >> >> > > > > Currently, Flink Client is mainly designed for one time >> >> interaction >> >> > > with >> >> > > > > the Flink Cluster. All the resources(http connections, >> threads, ha >> >> > > > > services) and instances(ClusterDescriptor, ClusterClient, >> >> RestClient) >> >> > > are >> >> > > > > created and recycled for each interaction. This works well when >> >> users >> >> > > do >> >> > > > > not need to interact frequently with Flink Cluster and also >> saves >> >> > > > resource >> >> > > > > usage since resources are recycled immediately after each >> usage. >> >> > > > > >> >> > > > > However, in OLAP or StreamingWarehouse scenarios, users might >> >> submit >> >> > > > > interactive jobs to a dedicated Flink Session Cluster very >> often. >> >> In >> >> > > this >> >> > > > > case, we find that for short queries that can finish in less >> than >> >> 1s >> >> > in >> >> > > > > Flink Cluster will still have E2E latency greater than 2s. >> Hence, >> >> we >> >> > > > > propose this FLIP to improve the Flink Client performance in >> this >> >> > > > scenario. >> >> > > > > This could also improve the user experience when using session >> >> debug >> >> > > > mode. >> >> > > > > >> >> > > > > The major change in this FLIP is that there will be a new >> >> introduced >> >> > > > option >> >> > > > > *'execution.interactive-client'*. When this option is enabled, >> >> Flink >> >> > > > > Client will reuse all the necessary resources to improve >> >> interactive >> >> > > > > performance, including: HA Services, HTTP connections, threads >> and >> >> > all >> >> > > > > kinds of instances related to a long-running Flink Cluster. The >> >> > default >> >> > > > > value of this option will be false, then Flink Client will >> behave >> >> as >> >> > > > > before. >> >> > > > > >> >> > > > > Also, this FLIP proposed a configurable RetryStrategy when >> >> fetching >> >> > > > results >> >> > > > > from client-side to Flink Cluster. In interactive scenarios, >> this >> >> can >> >> > > > save >> >> > > > > more than 15% of TM CPU usage without performance degradation. >> >> > > > > >> >> > > > > Looking forward to your feedback, thanks. >> >> > > > > >> >> > > > > Best regards, >> >> > > > > Xiangyu >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> > >> >