[
https://issues.apache.org/jira/browse/FLINK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xiangyu feng updated FLINK-32756:
---------------------------------
Description:
In OLAP scenario, we submit queries to flink session cluster through the
flink-sql-gateway service. When receiving queries, the gateway service will
create sessions to handle the query, each session will create a new
RestClusterClient to submit queries and a new ClientHAServices to discover the
latest address of the JobManager.
In our production usage, we have enabled JobManager ZK HA and use
ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices will
establish a network connection with ZK and create four ZK related threads.
When QPS reaches 200, more than 1000 sessions are created in a single
flink-sql-gateway instance, which means more than 1000 ZK connections and more
than 4000 ZK related threads are created simultaneously. This will raise a
significant stability risk in production.
To address this problem, we have created SharedZKClientHAService for different
sessions to share a ZK connection and ZKClient.
was:
In OLAP scenario, we submit queries to flink session cluster through the
flink-sql-gateway service. When receiving queries, the gateway service will
create sessions to handle the query, each session will create a new
RestClusterClient to submit queries and a new ClientHAServices to discover the
latest address of the JobManager.
In our production usage, we have enabled JobManager HA and use
ZKClientHAServices to do service discovery. Each ZKClientHAServices will
establish a network connection with ZK and create four ZK related threads.
When QPS reaches 200, more than 1000 sessions are created in a single
flink-sql-gateway instance, which means more than 1000 ZK connections and more
than 4000 ZK related threads are created simultaneously. This will raise a
significant stability risk in production.
> Reuse ClientHighAvailabilityServices in RestClusterClient when submitting
> OLAP jobs
> -----------------------------------------------------------------------------------
>
> Key: FLINK-32756
> URL: https://issues.apache.org/jira/browse/FLINK-32756
> Project: Flink
> Issue Type: Sub-task
> Components: Client / Job Submission
> Reporter: xiangyu feng
> Priority: Major
>
> In OLAP scenario, we submit queries to flink session cluster through the
> flink-sql-gateway service. When receiving queries, the gateway service will
> create sessions to handle the query, each session will create a new
> RestClusterClient to submit queries and a new ClientHAServices to discover
> the latest address of the JobManager.
>
> In our production usage, we have enabled JobManager ZK HA and use
> ZooKeeperClientHAServices to do service discovery. Each ZKClientHAServices
> will establish a network connection with ZK and create four ZK related
> threads.
> When QPS reaches 200, more than 1000 sessions are created in a single
> flink-sql-gateway instance, which means more than 1000 ZK connections and
> more than 4000 ZK related threads are created simultaneously. This will raise
> a significant stability risk in production.
>
> To address this problem, we have created SharedZKClientHAService for
> different sessions to share a ZK connection and ZKClient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)