[ 
https://issues.apache.org/jira/browse/FLINK-22343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409458#comment-17409458
 ] 

Till Rohrmann commented on FLINK-22343:
---------------------------------------

It looks as if the killing of the TMs is too aggressive so that no 2 
checkpoints can get completed until the timeout of 10 minutes is reached.

> Running HA per-job cluster (rocks, non-incremental) end-to-end test fails on 
> azure
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-22343
>                 URL: https://issues.apache.org/jira/browse/FLINK-22343
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.0
>            Reporter: Dawid Wysakowicz
>            Priority: Minor
>              Labels: auto-deprioritized-major, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16731&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=1629
> {code}
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_ha.sh: line 
> 151: [: 58)\n\tat 
> org.apache.flink.runtime.rest.handler.job.AbstractAccessExecutionGraphHandler.handleRequest(AbstractAccessExecutionGraphHandler.java:
>  integer expression expected
> Apr 18 21:19:39 Starting standalonejob daemon on host fv-az159-225.
> Apr 18 21:19:47 Killed JM @ 31100
> Apr 18 21:19:47 Waiting for text Completed checkpoint [1-9]* for job 
> 00000000000000000000000000000000 to appear 2 of times in logs...
> Apr 18 21:19:49 Killed TM @ 29999
> Apr 18 21:19:50 Starting standalonejob daemon on host fv-az159-225.
> Apr 18 21:20:35 Killed TM @ 32514
> Apr 18 21:21:43 Killed TM @ 2360
> Apr 18 21:21:55 Killed TM @ 7675
> Apr 18 21:22:53 Killed TM @ 8218
> Apr 18 21:23:05 Killed TM @ 11789
> Apr 18 21:24:14 Killed TM @ 12337
> Apr 18 21:25:23 Killed TM @ 17671
> Apr 18 21:25:35 Killed TM @ 21514
> Apr 18 21:26:23 Killed TM @ 22058
> Apr 18 21:27:20 Killed TM @ 23911
> Apr 18 21:27:32 Killed TM @ 25993
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_ha.sh: line 
> 151: [: 58)\n\tat 
> org.apache.flink.runtime.rest.handler.job.AbstractAccessExecutionGraphHandler.handleRequest(AbstractAccessExecutionGraphHandler.java:
>  integer expression expected
> Apr 18 21:28:30 Killed TM @ 26540
> Apr 18 21:29:38 Killed TM @ 28619
> Apr 18 21:29:55 A timeout occurred waiting for Completed checkpoint [1-9]* 
> for job 00000000000000000000000000000000 to appear 2 of times in logs.
> Apr 18 21:29:55 Stopping job timeout watchdog (with pid=26590)
> Apr 18 21:29:55 Killing JM watchdog @ 28473
> Apr 18 21:29:55 Killing TM watchdog @ 28474
> Apr 18 21:29:55 [FAIL] Test script contains errors.
> Apr 18 21:29:55 Checking of logs skipped.
> Apr 18 21:29:55 
> Apr 18 21:29:55 [FAIL] 'Running HA per-job cluster (rocks, non-incremental) 
> end-to-end test' failed after 11 minutes and 28 seconds! Test exited with 
> exit code 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to