[
https://issues.apache.org/jira/browse/FLINK-33683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xiangyu feng updated FLINK-33683:
---------------------------------
Description:
Now there are lots of unnecessary overhead involved in submitting jobs and
fetching results to a long-running flink cluster. This works well for streaming
and batch job, because in these scenarios users will not frequently submit jobs
and fetch result to a running cluster.
But in OLAP scenario, users will continuously submit lots of short-lived jobs
to the running cluster. In this situation, these overhead will have a huge
impact on the E2E performance. Here are some examples of unnecessary overhead:
* Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when
executing a job on the same remote cluster
* `StandaloneClusterDescriptor` will always create a new `RestClusterClient`
when retrieving an existing Flink Cluster
* Each `RestClusterClient` will create a new `ClientHighAvailabilityServices`
which might contains a resource-consuming ha client(ZKClient or KubeClient) and
a time-consuming leader retrieval operation
* `RestClient` will create a new connection for every request which costs
extra connection establishment time
Therefore, I suggest creating this ticket and following subtasks to improve
this performance. This ticket is also relates to FLINK-25318.
was:
There is now a lot of unnecessary overhead involved in submitting jobs and
fetching results to a long-running flink cluster. This works well for streaming
and batch job, because in these scenarios users will not frequently submit jobs
and fetch result to a running cluster.
But in OLAP scenario, users will continuously submit lots of short-lived jobs
to the running cluster. In this situation, these overhead will have a huge
impact on the E2E performance. Here are some examples of unnecessary overhead:
* Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when
executing a job on the same remote cluster
* `StandaloneClusterDescriptor` will always create a new `RestClusterClient`
when retrieving an existing Flink Cluster
* Each `RestClusterClient` will create a new `ClientHighAvailabilityServices`
which might contains a resource-consuming ha client(ZKClient or KubeClient) and
a time-consuming leader retrieval operation
* `RestClient` will create a new connection for every request which costs
extra connection establishment time
Therefore, I suggest creating this ticket and following subtasks to improve
this performance. This ticket is also relates to FLINK-25318.
> Improve the performance of submitting jobs and fetching results to a running
> flink cluster
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-33683
> URL: https://issues.apache.org/jira/browse/FLINK-33683
> Project: Flink
> Issue Type: Improvement
> Components: Client / Job Submission, Table SQL / Client
> Reporter: xiangyu feng
> Priority: Major
>
> Now there are lots of unnecessary overhead involved in submitting jobs and
> fetching results to a long-running flink cluster. This works well for
> streaming and batch job, because in these scenarios users will not frequently
> submit jobs and fetch result to a running cluster.
>
> But in OLAP scenario, users will continuously submit lots of short-lived jobs
> to the running cluster. In this situation, these overhead will have a huge
> impact on the E2E performance. Here are some examples of unnecessary
> overhead:
> * Each `RemoteExecutor` will create a new `StandaloneClusterDescriptor` when
> executing a job on the same remote cluster
> * `StandaloneClusterDescriptor` will always create a new `RestClusterClient`
> when retrieving an existing Flink Cluster
> * Each `RestClusterClient` will create a new
> `ClientHighAvailabilityServices` which might contains a resource-consuming ha
> client(ZKClient or KubeClient) and a time-consuming leader retrieval operation
> * `RestClient` will create a new connection for every request which costs
> extra connection establishment time
>
> Therefore, I suggest creating this ticket and following subtasks to improve
> this performance. This ticket is also relates to FLINK-25318.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)