[ https://issues.apache.org/jira/browse/FLINK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xiangyu feng updated FLINK-32756: --------------------------------- Description: In OLAP scenario, we submit queries to flink session cluster through the flink-sql-gateway service. When receiving queries, the gateway service will create sessions to handle the query, each session will create a new RestClusterClient to submit queries and a new ClientHAServices to discover the latest address of the JobManager. In our production usage, we have enabled JobManager ZK HA and use ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices will establish a network connection with ZK and create four ZK related threads. When QPS reaches 200, more than 1000 sessions are created in a single flink-sql-gateway instance, which means more than 1000 ZK connections and more than 4000 ZK related threads are created simultaneously. This will raise a significant stability risk in production. To address this problem, we have created SharedZKClientHAService for different sessions to share a ZK connection and ZKClient. was: In OLAP scenario, we submit queries to flink session cluster through the flink-sql-gateway service. When receiving queries, the gateway service will create sessions to handle the query, each session will create a new RestClusterClient to submit queries and a new ClientHAServices to discover the latest address of the JobManager. In our production usage, we have enabled JobManager HA and use ZKClientHAServices to do service discovery. Each ZKClientHAServices will establish a network connection with ZK and create four ZK related threads. When QPS reaches 200, more than 1000 sessions are created in a single flink-sql-gateway instance, which means more than 1000 ZK connections and more than 4000 ZK related threads are created simultaneously. This will raise a significant stability risk in production. > Reuse ClientHighAvailabilityServices in RestClusterClient when submitting > OLAP jobs > ----------------------------------------------------------------------------------- > > Key: FLINK-32756 > URL: https://issues.apache.org/jira/browse/FLINK-32756 > Project: Flink > Issue Type: Sub-task > Components: Client / Job Submission > Reporter: xiangyu feng > Priority: Major > > In OLAP scenario, we submit queries to flink session cluster through the > flink-sql-gateway service. When receiving queries, the gateway service will > create sessions to handle the query, each session will create a new > RestClusterClient to submit queries and a new ClientHAServices to discover > the latest address of the JobManager. > > In our production usage, we have enabled JobManager ZK HA and use > ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices > will establish a network connection with ZK and create four ZK related > threads. > When QPS reaches 200, more than 1000 sessions are created in a single > flink-sql-gateway instance, which means more than 1000 ZK connections and > more than 4000 ZK related threads are created simultaneously. This will raise > a significant stability risk in production. > > To address this problem, we have created SharedZKClientHAService for > different sessions to share a ZK connection and ZKClient. -- This message was sent by Atlassian Jira (v8.20.10#820010)