[
https://issues.apache.org/jira/browse/FLINK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066646#comment-17066646
]
Robert Metzger commented on FLINK-16423:
----------------------------------------
This time, it is the "Running HA (rocks, non-incremental) end-to-end" test
(defined in test_ha_datastream.sh):
https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6605&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5
{code}
2020-03-25T07:58:13.0519937Z
==============================================================================
2020-03-25T07:58:13.0520821Z Running 'Running HA (rocks, non-incremental)
end-to-end test'
2020-03-25T07:58:13.0521463Z
==============================================================================
2020-03-25T07:58:13.0568337Z TEST_DATA_DIR:
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-13056102659
2020-03-25T07:58:13.2459951Z Flink dist directory:
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
2020-03-25T07:58:13.2641419Z Flink dist directory:
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
2020-03-25T07:58:13.4935708Z Starting zookeeper daemon on host fv-az678.
2020-03-25T07:58:13.6586879Z Starting HA cluster with 1 masters.
2020-03-25T07:58:14.1123374Z Starting standalonesession daemon on host fv-az678.
2020-03-25T07:58:16.3858487Z Starting taskexecutor daemon on host fv-az678.
2020-03-25T07:58:16.4569875Z Waiting for Dispatcher REST endpoint to come up...
2020-03-25T07:58:17.5191614Z Waiting for Dispatcher REST endpoint to come up...
2020-03-25T07:58:18.5738994Z Waiting for Dispatcher REST endpoint to come up...
2020-03-25T07:58:19.9690298Z Waiting for Dispatcher REST endpoint to come up...
2020-03-25T07:58:21.0494277Z Waiting for Dispatcher REST endpoint to come up...
2020-03-25T07:58:22.0961059Z Dispatcher REST endpoint is up.
2020-03-25T07:58:22.0961573Z Running on HA mode: parallelism=4, backend=rocks,
asyncSnapshots=true, and incremSnapshots=false.
2020-03-25T07:58:33.1308229Z Job (65f21654a530ff9063116a91f419eb1b) is running.
2020-03-25T07:58:33.1309909Z Running JM watchdog @ 62236
2020-03-25T07:58:38.1331595Z Running TM watchdog @ 64235
2020-03-25T07:58:38.1333335Z Waiting for text Completed checkpoint [1-9]* for
job 65f21654a530ff9063116a91f419eb1b to appear 2 of times in logs...
2020-03-25T07:58:41.6939725Z Killed TM @ 60470
2020-03-25T09:55:52.6931725Z ##[error]The operation was canceled.
2020-03-25T09:55:52.6947548Z ##[section]Finishing: Run e2e tests
{code}
I report the failure from a different script here as well, as I suspect them to
be related.
> test_ha_per_job_cluster_datastream.sh gets stuck
> ------------------------------------------------
>
> Key: FLINK-16423
> URL: https://issues.apache.org/jira/browse/FLINK-16423
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Tests
> Reporter: Robert Metzger
> Priority: Major
>
> This was seen in
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=5905&view=logs&j=b1623ac9-0979-5b0d-2e5e-1377d695c991&t=e7804547-1789-5225-2bcf-269eeaa37447
> ... the relevant part of the logs is here:
> {code}
> 2020-03-04T11:27:25.4819486Z
> ==============================================================================
> 2020-03-04T11:27:25.4820470Z Running 'Running HA per-job cluster (rocks,
> non-incremental) end-to-end test'
> 2020-03-04T11:27:25.4820922Z
> ==============================================================================
> 2020-03-04T11:27:25.4840177Z TEST_DATA_DIR:
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-25482960156
> 2020-03-04T11:27:25.6712478Z Flink dist directory:
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-04T11:27:25.6830402Z Flink dist directory:
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-04T11:27:26.2988914Z Starting zookeeper daemon on host fv-az655.
> 2020-03-04T11:27:26.3001237Z Running on HA mode: parallelism=4,
> backend=rocks, asyncSnapshots=true, and incremSnapshots=false.
> 2020-03-04T11:27:27.4206924Z Starting standalonejob daemon on host fv-az655.
> 2020-03-04T11:27:27.4217066Z Start 1 more task managers
> 2020-03-04T11:27:30.8412541Z Starting taskexecutor daemon on host fv-az655.
> 2020-03-04T11:27:38.1779980Z Job (00000000000000000000000000000000) is
> running.
> 2020-03-04T11:27:38.1781375Z Running JM watchdog @ 89778
> 2020-03-04T11:27:38.1781858Z Running TM watchdog @ 89779
> 2020-03-04T11:27:38.1783272Z Waiting for text Completed checkpoint [1-9]* for
> job 00000000000000000000000000000000 to appear 2 of times in logs...
> 2020-03-04T13:21:29.9076797Z ##[error]The operation was canceled.
> 2020-03-04T13:21:29.9094090Z ##[section]Finishing: Run e2e tests
> {code}
> The last three lines indicate that the test is waiting forever for a
> checkpoint to appear.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)