[ https://issues.apache.org/jira/browse/FLINK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xiangyu feng updated FLINK-32756: --------------------------------- Description: In OLAP scenario, we submit queries to flink session cluster through the flink-sql-gateway service. When receiving queries, the gateway service will create sessions to handle the query, each session will create a new RestClusterClient and a new ClientHAServices. In our production usage, we have enabled JobManager ZK HA and use ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices will establish a network connection with ZK and create four ZK related threads. When QPS reaches 200, more than 1000 sessions are created in a single flink-sql-gateway instance, which means more than 1000 ZK connections and 4000 ZK related threads are created simultaneously. This will raise a significant stability risk in production. To address this problem, we have created SharedZKClientHAService for different sessions to share a ZK connection and ZKClient. was: In OLAP scenario, we submit queries to flink session cluster through the flink-sql-gateway service. When receiving queries, the gateway service will create sessions to handle the query, each session will create a new RestClusterClient and a new ClientHAServices. In our production usage, we have enabled JobManager ZK HA and use ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices will establish a network connection with ZK and create four ZK related threads. When QPS reaches 200, more than 1000 sessions are created in a single flink-sql-gateway instance, which means more than 1000 ZK connections and more than 4000 ZK related threads are created simultaneously. This will raise a significant stability risk in production. To address this problem, we have created SharedZKClientHAService for different sessions to share a ZK connection and ZKClient. > Reuse ClientHighAvailabilityServices in RestClusterClient when submitting > OLAP jobs > ----------------------------------------------------------------------------------- > > Key: FLINK-32756 > URL: https://issues.apache.org/jira/browse/FLINK-32756 > Project: Flink > Issue Type: Sub-task > Components: Client / Job Submission > Reporter: xiangyu feng > Priority: Major > > In OLAP scenario, we submit queries to flink session cluster through the > flink-sql-gateway service. When receiving queries, the gateway service will > create sessions to handle the query, each session will create a new > RestClusterClient and a new ClientHAServices. > > In our production usage, we have enabled JobManager ZK HA and use > ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices > will establish a network connection with ZK and create four ZK related > threads. > > When QPS reaches 200, more than 1000 sessions are created in a single > flink-sql-gateway instance, which means more than 1000 ZK connections and > 4000 ZK related threads are created simultaneously. This will raise a > significant stability risk in production. > > To address this problem, we have created SharedZKClientHAService for > different sessions to share a ZK connection and ZKClient. -- This message was sent by Atlassian Jira (v8.20.10#820010)