[
https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759429#comment-17759429
]
Yang Wang edited comment on FLINK-32678 at 8/28/23 4:04 AM:
------------------------------------------------------------
*Functionality Test*
1. [SUCCEED] Build the docker image with release-1.18 branch
2. [SUCCEED] Use the flink-k8s-operator to start a Flink app with HA enabled,
check the logs, UI
3. [SUCCEED] Check HA ConfigMaps, one for leader election and one for the job
checkpoint
4. [SUCCEED] Check the thread dump of the JobManager and verify only one leader
elector is running(the value is 4 before 1.15 with old HA)
5. [SUCCEED] Use the command {{kubectl exec
flink-example-statemachine-897cb6d4f-bzdv5 – /bin/sh -c 'kill 1'}} to kill the
JobManager and verify no more TaskManager is created(Flink should reuse the
existing TaskManager before idle timeout).
6. [SUCCEED] Verify the Flink job recover from the latest checkpoint and keep
running
2023-08-28 03:40:29,167 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job
596bdc6b7ac5bcefb611c3df08d64520 from Checkpoint 101 @ 1693194000259 for
596bdc6b7ac5bcefb611c3df08d64520 located at
oss://flink-test/flink-k8s-ha-stress-test/flink-cp/596bdc6b7ac5bcefb611c3df08d64520/chk-101.
All the things work well after refactoring of leader-election, akka, and
flink-shaded. I just find a log that could be improved by replacing the object
id with some more meaningful name.
2023-08-28 03:40:18,258 INFO
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
LeaderContender has been registered under component 'resourcemanager' for
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver@2a19a0fe.
I am still working on the stress test and will share the result later today.
was (Author: fly_in_gis):
# Functionality Test
1. [SUCCEED] Build the docker image with release-1.18 branch
2. [SUCCEED] Use the flink-k8s-operator to start a Flink app with HA enabled,
check the logs, UI
3. [SUCCEED] Check HA ConfigMaps, one for leader election and one for the job
checkpoint
4. [SUCCEED] Check the thread dump of the JobManager and verify only one leader
elector is running(the value is 4 before 1.15 with old HA)
5. [SUCCEED] Use the command {{kubectl exec
flink-example-statemachine-897cb6d4f-bzdv5 -- /bin/sh -c 'kill 1'}} to kill
the JobManager and verify no more TaskManager is created(Flink should reuse the
existing TaskManager before idle timeout).
6. [SUCCEED] Verify the Flink job recover from the latest checkpoint and keep
running
2023-08-28 03:40:29,167 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job
596bdc6b7ac5bcefb611c3df08d64520 from Checkpoint 101 @ 1693194000259 for
596bdc6b7ac5bcefb611c3df08d64520 located at
oss://flink-test/flink-k8s-ha-stress-test/flink-cp/596bdc6b7ac5bcefb611c3df08d64520/chk-101.
All the things work well after refactoring of leader-election, akka, and
flink-shaded. I just find a log that could be improved by replacing the object
id with some more meaningful name.
2023-08-28 03:40:18,258 INFO
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
LeaderContender has been registered under component 'resourcemanager' for
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver@2a19a0fe.
I am still working on the stress test and will share the result later today.
> Release Testing: Stress-Test to cover multiple low-level changes in Flink
> -------------------------------------------------------------------------
>
> Key: FLINK-32678
> URL: https://issues.apache.org/jira/browse/FLINK-32678
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Affects Versions: 1.18.0
> Reporter: Matthias Pohl
> Assignee: Yang Wang
> Priority: Major
> Labels: release-testing
> Fix For: 1.18.0
>
>
> -We decided to do another round of testing for the LeaderElection refactoring
> which happened in
> [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].-
> This release testing task is about running a bigger amount of jobs in a Flink
> environment to look for unusual behavior. This Jira issue shall cover the
> following 1.18 efforts:
> * Leader Election refactoring
> ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box],
> FLINK-26522)
> * Akka to Pekko transition (FLINK-32468)
> * flink-shaded 17.0 updates (FLINK-32032)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)