Hi devs, I'm opening this thread to discuss FLIP-407: Improve Flink Client performance in interactive scenarios. The POC test results and design doc can be found at: FLIP-407 <https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+when+interacting+with+dedicated+Flink+Session+Clusters> .
Currently, Flink Client is mainly designed for one time interaction with the Flink Cluster. All the resources(http connections, threads, ha services) and instances(ClusterDescriptor, ClusterClient, RestClient) are created and recycled for each interaction. This works well when users do not need to interact frequently with Flink Cluster and also saves resource usage since resources are recycled immediately after each usage. However, in OLAP or StreamingWarehouse scenarios, users might submit interactive jobs to a dedicated Flink Session Cluster very often. In this case, we find that for short queries that can finish in less than 1s in Flink Cluster will still have E2E latency greater than 2s. Hence, we propose this FLIP to improve the Flink Client performance in this scenario. This could also improve the user experience when using session debug mode. The major change in this FLIP is that there will be a new introduced option *'execution.interactive-client'*. When this option is enabled, Flink Client will reuse all the necessary resources to improve interactive performance, including: HA Services, HTTP connections, threads and all kinds of instances related to a long-running Flink Cluster. The default value of this option will be false, then Flink Client will behave as before. Also, this FLIP proposed a configurable RetryStrategy when fetching results from client-side to Flink Cluster. In interactive scenarios, this can save more than 15% of TM CPU usage without performance degradation. Looking forward to your feedback, thanks. Best regards, Xiangyu
