[jira] [Comment Edited] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

Yang Wang (Jira) Mon, 28 Aug 2023 22:21:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759570#comment-17759570
 ]


Yang Wang edited comment on FLINK-32678 at 8/29/23 5:20 AM:
------------------------------------------------------------

*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is twice as much as {{{}PUT{}}}.

2. Flink version 1.17.1(same as 1.15.4 with 
{{{}high-availability.use-old-ha-services=false{}}})
Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} 
 for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink 
JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice 
as much as {{{}PUT{}}}.
 
3. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is same as {{{}PATCH{}}}.
 
!qps-configmap-put-115.png|width=694,height=176!
!qps-configmap-put-117.jpg|width=694,height=176!
!qps-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qps-configmap-get-115.png|width=694,height=176!
!qps-configmap-get-117.jpg|width=694,height=176!
!qps-configmap-get-118.png|width=694,height=176!

We could also find that the read requests are half of the 1.17. The root cause 
is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating 
the leader annotation. It will save a GET request for each update.

 
||Flink Version||PUT/PATCH QPS||GET QPS||
|1.15.4 with old HA|800|1600|
|1.17.1|200|400|
|1.18.0|200|200|

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.


was (Author: fly_in_gis):
*Stress Test*
Run 1000 Flink Jobs with 1 JM and 1 TM for each
1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}}
Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, 
JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT 
ConfigMap}}  for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is twice as much as {{PUT}}.

2. Flink version 1.17.1(same as 1.15.4 with 
{{high-availability.use-old-ha-services=false}})
Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} 
 for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink 
JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice 
as much as {{PUT}}.
 
3. Flink version 1.18-snapshot
Flink will only have one shared leader elector. So the QPS of {{PATCH 
ConfigMap}}  for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 
1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET 
ConfigMap}} is same as {{PATCH}}.
 
!qps-configmap-put-115.png|width=694,height=176!
!qps-configmap-put-117.png|width=694,height=176!
!qps-configmap-patch-118.png|width=694,height=176!

>From the above two pictures, we could verify that the new leader elector in 
>1.18 only sends a quarter of the write requests of the old one in 1.15 on the 
>K8s APIServer. It will significantly reduce the stress on the K8s APIServer.

 

!qps-configmap-get-115.png|width=694,height=176!
!qps-configmap-get-117.png|width=694,height=176!
!qps-configmap-get-118.png|width=694,height=176!

We could also find that the read requests are half of the 1.17. The root cause 
is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating 
the leader annotation. It will save a GET request for each update.

 

All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 
1000 Flink jobs run normally as before.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -------------------------------------------------------------------------
>
>                 Key: FLINK-32678
>                 URL: https://issues.apache.org/jira/browse/FLINK-32678
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.0
>            Reporter: Matthias Pohl
>            Assignee: Yang Wang
>            Priority: Major
>              Labels: release-testing
>             Fix For: 1.18.0
>
>         Attachments: qps-configmap-get-115.png, qps-configmap-get-117.jpg, 
> qps-configmap-get-118.png, qps-configmap-patch-118.png, 
> qps-configmap-put-115.png, qps-configmap-put-117.jpg
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

Reply via email to