[
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759570#comment-17759570
]
Yang Wang commented on FLINK-32678:
-----------------------------------
*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher,
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT
ConfigMap}} for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) *
1000(Flink JobManager pods) / 5(renew interval).
2. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH
ConfigMap}} for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) *
1000(Flink JobManager pods) / 5(renew interval).
!qos-configmap-put-115.png|width=694,height=176!
!qos-configmap-patch-118.png|width=694,height=176!
>From the above two pictures, we could verify that the new leader elector in
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.
!qos-configmap-get-115.png|width=694,height=176!
!qos-configmap-get-118.png|width=694,height=176!
We also find that the read requests are only 1/8 of the old one. The root cause
is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating
the leader annotation. It will save a GET request for each update.
All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the
1000 Flink jobs run normally as before.
> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -------------------------------------------------------------------------
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Affects Versions: 1.18.0
> Reporter: Matthias Pohl
> Assignee: Yang Wang
> Priority: Major
> Labels: release-testing
> Fix For: 1.18.0
>
> Attachments: qos-configmap-get-115.png, qos-configmap-get-118.png,
> qos-configmap-patch-118.png, qos-configmap-put-115.png
>
>
> -We decided to do another round of testing for the LeaderElection refactoring
> which happened in
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink
> environment to look for unusual behavior. This Jira issue shall cover the
> following 1.18 efforts:
> * Leader Election refactoring
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
> FLINK-26522)
> * Akka to Pekko transition (FLINK-32468)
> * flink-shaded 17.0 updates (FLINK-32032)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)