[
https://issues.apache.org/jira/browse/FLINK-22343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409458#comment-17409458
]
Till Rohrmann commented on FLINK-22343:
---------------------------------------
It looks as if the killing of the TMs is too aggressive so that no 2
checkpoints can get completed until the timeout of 10 minutes is reached.
> Running HA per-job cluster (rocks, non-incremental) end-to-end test fails on
> azure
> ----------------------------------------------------------------------------------
>
> Key: FLINK-22343
> URL: https://issues.apache.org/jira/browse/FLINK-22343
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.13.0
> Reporter: Dawid Wysakowicz
> Priority: Minor
> Labels: auto-deprioritized-major, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16731&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=1629
> {code}
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_ha.sh: line
> 151: [: 58)\n\tat
> org.apache.flink.runtime.rest.handler.job.AbstractAccessExecutionGraphHandler.handleRequest(AbstractAccessExecutionGraphHandler.java:
> integer expression expected
> Apr 18 21:19:39 Starting standalonejob daemon on host fv-az159-225.
> Apr 18 21:19:47 Killed JM @ 31100
> Apr 18 21:19:47 Waiting for text Completed checkpoint [1-9]* for job
> 00000000000000000000000000000000 to appear 2 of times in logs...
> Apr 18 21:19:49 Killed TM @ 29999
> Apr 18 21:19:50 Starting standalonejob daemon on host fv-az159-225.
> Apr 18 21:20:35 Killed TM @ 32514
> Apr 18 21:21:43 Killed TM @ 2360
> Apr 18 21:21:55 Killed TM @ 7675
> Apr 18 21:22:53 Killed TM @ 8218
> Apr 18 21:23:05 Killed TM @ 11789
> Apr 18 21:24:14 Killed TM @ 12337
> Apr 18 21:25:23 Killed TM @ 17671
> Apr 18 21:25:35 Killed TM @ 21514
> Apr 18 21:26:23 Killed TM @ 22058
> Apr 18 21:27:20 Killed TM @ 23911
> Apr 18 21:27:32 Killed TM @ 25993
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_ha.sh: line
> 151: [: 58)\n\tat
> org.apache.flink.runtime.rest.handler.job.AbstractAccessExecutionGraphHandler.handleRequest(AbstractAccessExecutionGraphHandler.java:
> integer expression expected
> Apr 18 21:28:30 Killed TM @ 26540
> Apr 18 21:29:38 Killed TM @ 28619
> Apr 18 21:29:55 A timeout occurred waiting for Completed checkpoint [1-9]*
> for job 00000000000000000000000000000000 to appear 2 of times in logs.
> Apr 18 21:29:55 Stopping job timeout watchdog (with pid=26590)
> Apr 18 21:29:55 Killing JM watchdog @ 28473
> Apr 18 21:29:55 Killing TM watchdog @ 28474
> Apr 18 21:29:55 [FAIL] Test script contains errors.
> Apr 18 21:29:55 Checking of logs skipped.
> Apr 18 21:29:55
> Apr 18 21:29:55 [FAIL] 'Running HA per-job cluster (rocks, non-incremental)
> end-to-end test' failed after 11 minutes and 28 seconds! Test exited with
> exit code 1
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)