GJL commented on a change in pull request #9060: [FLINK-13145][tests] Run HA
dataset E2E test with new RestartPipelinedRegionStrategy
URL: https://github.com/apache/flink/pull/9060#discussion_r302172084
##########
File path: flink-end-to-end-tests/test-scripts/test_ha_dataset.sh
##########
@@ -53,20 +52,51 @@ function run_ha_test() {
wait_job_running ${JOB_ID}
- # start the watchdog that keeps the number of JMs stable
- start_ha_jm_watchdog 1 "StandaloneSessionClusterEntrypoint" start_jm_cmd
"8081"
-
+ local c
for (( c=0; c<${JM_KILLS}; c++ )); do
# kill the JM and wait for watchdog to
# create a new one which will take over
kill_single 'StandaloneSessionClusterEntrypoint'
wait_job_running ${JOB_ID}
done
- cancel_job ${JOB_ID}
+ for (( c=0; c<${TM_KILLS}; c++ )); do
+ sleep $(( ( RANDOM % 10 ) + 1 ))
+ kill_and_replace_random_task_manager
+ wait_job_running ${JOB_ID}
+ done
+
+ wait_job_terminal_state ${JOB_ID} "FINISHED"
Review comment:
These are valid concerns.
> How much longer does the test now run for?
The test runs 4.5-5 minutes on my machine. It takes around 2 minutes to
complete the batch job after the last injected fault (time determined using
unscientific methods). The test in its current form is rather similar to
`test_batch_allround.sh` so there is a chance that these can be merged.
> I like neither option, do admit though that this would make it very
difficult (or even impossible) to verify the correctness of the output.
I don't see a good solution yet. Here are some options:
1. Make job block on external signals (files), and make job smaller (smaller
dataset)
1. Leave it as before, i.e., don't verify correctness of the output
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services