[
https://issues.apache.org/jira/browse/FLINK-23210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409466#comment-17409466
]
Till Rohrmann commented on FLINK-23210:
---------------------------------------
Hard to tell what went wrong because the logs are no longer available. I looks
as if there is a gap of 10 minutes where the system could not create 2
checkpoints w/o killing of a TM.
> Running HA per-job cluster (hashmap, sync) end-to-end test failed on azure
> --------------------------------------------------------------------------
>
> Key: FLINK-23210
> URL: https://issues.apache.org/jira/browse/FLINK-23210
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.13.1
> Reporter: Dawid Wysakowicz
> Priority: Major
> Labels: stale-major, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=19776&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529&l=11631
> {code}
> Jul 01 12:54:55
> Jul 01 12:54:55
> ==============================================================================
> Jul 01 12:54:55 Running 'Running HA per-job cluster (hashmap, sync)
> end-to-end test'
> Jul 01 12:54:55
> ==============================================================================
> Jul 01 12:54:55 TEST_DATA_DIR:
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-55217860509
> Jul 01 12:54:55 Flink dist directory:
> /home/vsts/work/1/s/flink-dist/target/flink-1.13-SNAPSHOT-bin/flink-1.13-SNAPSHOT
> Jul 01 12:54:55 Flink dist directory:
> /home/vsts/work/1/s/flink-dist/target/flink-1.13-SNAPSHOT-bin/flink-1.13-SNAPSHOT
> Jul 01 12:54:55 Starting zookeeper daemon on host fv-az91-551.
> Jul 01 12:54:55 Running on HA mode: parallelism=4, backend=hashmap,
> asyncSnapshots=false, incremSnapshots=false and zk=3.5.
> Jul 01 12:54:56 Starting standalonejob daemon on host fv-az91-551.
> Jul 01 12:54:56 Start 1 more task managers
> Jul 01 12:54:58 Starting taskexecutor daemon on host fv-az91-551.
> Jul 01 12:55:02 Job (00000000000000000000000000000000) is not yet running.
> Jul 01 12:55:07 Job (00000000000000000000000000000000) is running.
> Jul 01 12:55:07 Running JM watchdog @ 351161
> Jul 01 12:55:07 Running TM watchdog @ 351162
> Jul 01 12:55:07 Waiting for text Completed checkpoint [1-9]* for job
> 00000000000000000000000000000000 to appear 2 of times in logs...
> Jul 01 12:55:08 Killed JM @ 350374
> Jul 01 12:55:08 Waiting for text Completed checkpoint [1-9]* for job
> 00000000000000000000000000000000 to appear 2 of times in logs...
> grep: /home/vsts/work/_temp/debug_files/flink-logs/*standalonejob-1*.log: No
> such file or directory
> grep: /home/vsts/work/_temp/debug_files/flink-logs/*standalonejob-1*.log: No
> such file or directory
> Jul 01 12:55:09 Killed TM @ 350614
> grep: /home/vsts/work/_temp/debug_files/flink-logs/*standalonejob-1*.log: No
> such file or directory
> Jul 01 12:55:10 Starting standalonejob daemon on host fv-az91-551.
> Jul 01 12:55:55 Killed TM @ 351701
> Jul 01 12:55:55 Killed JM @ 351836
> Jul 01 12:55:55 Waiting for text Completed checkpoint [1-9]* for job
> 00000000000000000000000000000000 to appear 2 of times in logs...
> grep: /home/vsts/work/_temp/debug_files/flink-logs/*standalonejob-2*.log: No
> such file or directory
> grep: /home/vsts/work/_temp/debug_files/flink-logs/*standalonejob-2*.log: No
> such file or directory
> Jul 01 12:55:57 Starting standalonejob daemon on host fv-az91-551.
> grep: /home/vsts/work/_temp/debug_files/flink-logs/*standalonejob-2*.log: No
> such file or directory
> Jul 01 12:56:44 Killed TM @ 353554
> Jul 01 12:56:56 Killed TM @ 355735
> Jul 01 13:06:00 A timeout occurred waiting for Completed checkpoint [1-9]*
> for job 00000000000000000000000000000000 to appear 2 of times in logs.
> Jul 01 13:06:00 Stopping job timeout watchdog (with pid=349933)
> Jul 01 13:06:00 Killing JM watchdog @ 351161
> Jul 01 13:06:00 Killing TM watchdog @ 351162
> Jul 01 13:06:00 [FAIL] Test script contains errors.
> Jul 01 13:06:00 Checking of logs skipped.
> Jul 01 13:06:00
> Jul 01 13:06:00 [FAIL] 'Running HA per-job cluster (hashmap, sync) end-to-end
> test' failed after 11 minutes and 5 seconds! Test exited with exit code 1
> Jul 01 13:06:00
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)