[jira] [Comment Edited] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

Yang Wang (Jira) Sun, 27 Aug 2023 21:05:15 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759429#comment-17759429
 ]


Yang Wang edited comment on FLINK-32678 at 8/28/23 4:04 AM:
------------------------------------------------------------

*Functionality Test*
1. [SUCCEED] Build the docker image with release-1.18 branch
2. [SUCCEED] Use the flink-k8s-operator to start a Flink app with HA enabled, 
check the logs, UI
3. [SUCCEED] Check HA ConfigMaps, one for leader election and one for the job 
checkpoint
4. [SUCCEED] Check the thread dump of the JobManager and verify only one leader 
elector is running(the value is 4 before 1.15 with old HA)
5. [SUCCEED] Use the command {{kubectl exec 
flink-example-statemachine-897cb6d4f-bzdv5 – /bin/sh -c 'kill 1'}}  to kill the 
JobManager and verify no more TaskManager is created(Flink should reuse the 
existing TaskManager before idle timeout).
6. [SUCCEED] Verify the Flink job recover from the latest checkpoint and keep 
running
2023-08-28 03:40:29,167 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job 
596bdc6b7ac5bcefb611c3df08d64520 from Checkpoint 101 @ 1693194000259 for 
596bdc6b7ac5bcefb611c3df08d64520 located at 
oss://flink-test/flink-k8s-ha-stress-test/flink-cp/596bdc6b7ac5bcefb611c3df08d64520/chk-101.
 
 
All the things work well after refactoring of leader-election, akka, and 
flink-shaded. I just find a log that could be improved by replacing the object 
id with some more meaningful name.
2023-08-28 03:40:18,258 INFO 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
LeaderContender has been registered under component 'resourcemanager' for 
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver@2a19a0fe.
 

 

I am still working on the stress test and will share the result later today.


was (Author: fly_in_gis):
# Functionality Test
1. [SUCCEED] Build the docker image with release-1.18 branch
2. [SUCCEED] Use the flink-k8s-operator to start a Flink app with HA enabled, 
check the logs, UI
3. [SUCCEED] Check HA ConfigMaps, one for leader election and one for the job 
checkpoint
4. [SUCCEED] Check the thread dump of the JobManager and verify only one leader 
elector is running(the value is 4 before 1.15 with old HA)
5. [SUCCEED] Use the command {{kubectl exec 
flink-example-statemachine-897cb6d4f-bzdv5 -- /bin/sh -c 'kill 1'}}  to kill 
the JobManager and verify no more TaskManager is created(Flink should reuse the 
existing TaskManager before idle timeout).
6. [SUCCEED] Verify the Flink job recover from the latest checkpoint and keep 
running
2023-08-28 03:40:29,167 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
596bdc6b7ac5bcefb611c3df08d64520 from Checkpoint 101 @ 1693194000259 for 
596bdc6b7ac5bcefb611c3df08d64520 located at 
oss://flink-test/flink-k8s-ha-stress-test/flink-cp/596bdc6b7ac5bcefb611c3df08d64520/chk-101.
 
 
All the things work well after refactoring of leader-election, akka, and 
flink-shaded. I just find a log that could be improved by replacing the object 
id with some more meaningful name.
2023-08-28 03:40:18,258 INFO  
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
LeaderContender has been registered under component 'resourcemanager' for 
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver@2a19a0fe.
 

 

I am still working on the stress test and will share the result later today.

> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -------------------------------------------------------------------------
>
>                 Key: FLINK-32678
>                 URL: https://issues.apache.org/jira/browse/FLINK-32678
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.0
>            Reporter: Matthias Pohl
>            Assignee: Yang Wang
>            Priority: Major
>              Labels: release-testing
>             Fix For: 1.18.0
>
>
> -We decided to do another round of testing for the LeaderElection refactoring 
> which happened in 
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink 
> environment to look for unusual behavior. This Jira issue shall cover the 
> following 1.18 efforts:
>  * Leader Election refactoring 
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
>  FLINK-26522)
>  * Akka to Pekko transition (FLINK-32468)
>  * flink-shaded 17.0 updates (FLINK-32032)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-32678) Release Testing: Stress-Test to cover multiple low-level changes in Flink

Reply via email to