[
https://issues.apache.org/jira/browse/FLINK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636970#comment-17636970
]
Xintong Song commented on FLINK-30101:
--------------------------------------
I'm not sure about the proposed changes. {{StandaloneClientHAServices}} and
{{StandaloneLeaderRetrievalService}} assumes there's only one contender, which
should always be the leader. There's no such guarantee when running a Yarn
deployment. It is possible that the leadership changes after getting the
application report, and ZK HA makes sure the rest client always connects to the
latest leader address in such cases.
For short sql jobs, you may want to consider sql-gateway, which does not fetch
leader address for every submitted job. Unfortunately, there's no such thing
for DataStream / Table API jobs. Besides, you may also consider a non-HA
cluster, if the end-to-end latency is cared mostly.
> Always use StandaloneClientHAServices to create RestClusterClient when
> retriving a Flink on YARN cluster client
> ----------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-30101
> URL: https://issues.apache.org/jira/browse/FLINK-30101
> Project: Flink
> Issue Type: Improvement
> Components: Client / Job Submission
> Affects Versions: 1.16.0
> Reporter: Zhanghao Chen
> Priority: Major
> Fix For: 1.17.0
>
>
> *Problem*
> Currently, the procedure of retrieving a Flink on YARN cluster client is as
> follows (in YarnClusterDescriptor#retrieve method):
> # Get application report from YARN
> # Set rest.address & rest.port using the info from application report
> # Create a new RestClusterClient using the updated configuration, will use
> client HA serivce to fetch the rest.address & rest.port if HA is enabled
> Here, we can see that the usage of client HA in step 3 is redundant, as we've
> already got the rest.address & rest.port from YARN application report. When
> ZK HA is enabled, this would take ~1.5 s to initialize client HA services and
> fetch the rest IP & port.
> 1.5 s can mean a lot for latency-sensitive client operations. In my company,
> we use Flink client to submit short-running session jobs and e2e latency is
> critical. The job submission time is around 10 s on average, and 1.5s would
> mean a 15% time saving.
> *Proposal*
> When retrieving a Flink on YARN cluster client, use
> StandaloneClientHAServices to
> create RestClusterClient instead as we have pre-fetched rest.address &
> rest.port from YARN application report. This is also what we did in
> KubernetesClusterDescriptor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)