[
https://issues.apache.org/jira/browse/FLINK-21329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311782#comment-17311782
]
Matthias commented on FLINK-21329:
----------------------------------
[~trohrmann] you're right. I forgot to assign the issue.
I looked into the different failures with the help of [~chesnay]. It appears
that the job failures are caused by reaching the timeout of the build. This
triggers the killing of child processes which results in random failures (e.g.
ZooKeeper becoming unavailable). Considering that we're close to the release we
might want to increase the timeout rather than putting effort into fixing the
build times, I guess?
> "Local recovery and sticky scheduling end-to-end test" does not finish within
> 600 seconds
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-21329
> URL: https://issues.apache.org/jira/browse/FLINK-21329
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.13.0
> Reporter: Robert Metzger
> Assignee: Till Rohrmann
> Priority: Critical
> Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13118&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529&l=38515
> {code}
> Feb 08 22:25:46
> ==============================================================================
> Feb 08 22:25:46 Running 'Local recovery and sticky scheduling end-to-end test'
> Feb 08 22:25:46
> ==============================================================================
> Feb 08 22:25:46 TEST_DATA_DIR:
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-46881214821
> Feb 08 22:25:47 Flink dist directory:
> /home/vsts/work/1/s/flink-dist/target/flink-1.13-SNAPSHOT-bin/flink-1.13-SNAPSHOT
> Feb 08 22:25:47 Running local recovery test with configuration:
> Feb 08 22:25:47 parallelism: 4
> Feb 08 22:25:47 max attempts: 10
> Feb 08 22:25:47 backend: rocks
> Feb 08 22:25:47 incremental checkpoints: false
> Feb 08 22:25:47 kill JVM: false
> Feb 08 22:25:47 Starting zookeeper daemon on host fv-az127-394.
> Feb 08 22:25:47 Starting HA cluster with 1 masters.
> Feb 08 22:25:48 Starting standalonesession daemon on host fv-az127-394.
> Feb 08 22:25:49 Starting taskexecutor daemon on host fv-az127-394.
> Feb 08 22:25:49 Waiting for Dispatcher REST endpoint to come up...
> Feb 08 22:25:50 Waiting for Dispatcher REST endpoint to come up...
> Feb 08 22:25:51 Waiting for Dispatcher REST endpoint to come up...
> Feb 08 22:25:53 Waiting for Dispatcher REST endpoint to come up...
> Feb 08 22:25:54 Dispatcher REST endpoint is up.
> Feb 08 22:25:54 Started TM watchdog with PID 28961.
> Feb 08 22:25:58 Job has been submitted with JobID
> e790e85a39040539f9386c0df7ca4812
> Feb 08 22:35:47 Test (pid: 27970) did not finish after 600 seconds.
> Feb 08 22:35:47 Printing Flink logs and killing it:
> {code}
> and
> {code}
> at
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.unhandledError(ZooKeeperLeaderRetrievalDriver.java:184)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by:
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss
> at
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
> ... 10 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)