[ 
https://issues.apache.org/jira/browse/FLINK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangyu feng updated FLINK-32756:
---------------------------------
    Description: 
In OLAP scenario, we submit queries to flink session cluster through the 
flink-sql-gateway service. When receiving queries, the gateway service will 
create sessions to handle the query, each session will create a new 
RestClusterClient to submit queries and a new ClientHAServices to discover the 
latest address of the JobManager.

 

In our production usage, we have enabled JobManager ZK HA and use 
ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices will 
establish a network connection with ZK and create four ZK related threads.

When QPS reaches 200, more than 1000 sessions are created in a single 
flink-sql-gateway instance, which means more than 1000 ZK connections and more 
than 4000 ZK related threads are created simultaneously. This will raise a 
significant stability risk in production.

 

To address this problem, we have created SharedZKClientHAService for different 
sessions to share a ZK connection and ZKClient.

  was:
In OLAP scenario, we submit queries to flink session cluster through the 
flink-sql-gateway service. When receiving queries, the gateway service will 
create sessions to handle the query, each session will create a new 
RestClusterClient to submit queries and a new ClientHAServices to discover the 
latest address of the JobManager.

In our production usage, we have enabled JobManager HA and use 
ZKClientHAServices to do service discovery. Each ZKClientHAServices will 
establish a network connection with ZK and create four ZK related threads. 

When QPS reaches 200, more than 1000 sessions are created in a single 
flink-sql-gateway instance, which means more than 1000 ZK connections and more 
than 4000 ZK related threads are created simultaneously. This will raise a 
significant stability risk in production.


> Reuse ClientHighAvailabilityServices in RestClusterClient when submitting 
> OLAP jobs
> -----------------------------------------------------------------------------------
>
>                 Key: FLINK-32756
>                 URL: https://issues.apache.org/jira/browse/FLINK-32756
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Client / Job Submission
>            Reporter: xiangyu feng
>            Priority: Major
>
> In OLAP scenario, we submit queries to flink session cluster through the 
> flink-sql-gateway service. When receiving queries, the gateway service will 
> create sessions to handle the query, each session will create a new 
> RestClusterClient to submit queries and a new ClientHAServices to discover 
> the latest address of the JobManager.
>  
> In our production usage, we have enabled JobManager ZK HA and use 
> ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices 
> will establish a network connection with ZK and create four ZK related 
> threads.
> When QPS reaches 200, more than 1000 sessions are created in a single 
> flink-sql-gateway instance, which means more than 1000 ZK connections and 
> more than 4000 ZK related threads are created simultaneously. This will raise 
> a significant stability risk in production.
>  
> To address this problem, we have created SharedZKClientHAService for 
> different sessions to share a ZK connection and ZKClient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to