[
https://issues.apache.org/jira/browse/FLINK-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135925#comment-17135925
]
Robert Metzger commented on FLINK-15661:
----------------------------------------
Thanks a lot for investigating the issue.
What is indeed concerning are log lines like this (on *AlibabaCI003-agent03*)
{code}
20:46:56,889 [ main] INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Actor system
started at akka.tcp://[email protected]:46158
20:47:00,486 [main-SendThread(127.0.0.1:43668)] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client
session timed out, have not heard from server in 4040ms for sessionid
0x1724870252b0004
{code}
Where nothing seems to be happening for ~4 seconds.
I have observed other "hard to explain" networking issues on the
AliCloud-hosted CI machines. So far, I have suspected an issue in the network
stack, but maybe I need to consider other OS components as well (process
scheduler, docker, the underlying "hardware").
Why Scheduler?
Machines are running up to 8 test jobs concurrently, on 32 cores. Not sure if
the Kernel sometimes has some hiccups scheduling this many workloads.
Why Docker?
The machines run an older (still supported) CentOs with a Kernel from 2013,
with Docker from 2019. It seems that Linux 3.10 (which we are using) is the
first to be supported by Docker.
Why underlying hardware?
It's a cloud-hosted machine, which sometimes get's live-migrated (example:
https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#live_migrate)
What were the other 7 builders doing on the machine at that time?
*AlibabaCI003-agent04*
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2085&view=logs&j=6bfdaf55-0c08-5e3f-a2d2-2a0285fd41cf&t=93018cfc-6498-565b-e034-4b68eb90fc80
{code}
2020-05-24T20:46:56.5617885Z [INFO] Tests run: 16, Failures: 0, Errors: 0,
Skipped: 0, Time elapsed: 0.164 s - in
org.apache.flink.api.common.typeutils.base.LocalDateTimeSerializerTest
2020-05-24T20:46:56.5623520Z [INFO] Running
org.apache.flink.api.common.typeutils.base.LocalDateSerializerTest
2020-05-24T20:47:00.5093823Z [INFO] Tests run: 16, Failures: 0, Errors: 0,
Skipped: 0, Time elapsed: 0.142 s - in
org.apache.flink.api.common.typeutils.base.LocalDateSerializerTest
2020-05-24T20:47:00.5094515Z [INFO] Running
org.apache.flink.api.common.typeutils.base.ShortComparatorTest
{code}
==> 4 seconds no output
*AlibabaCI003-agent06* was compiling Flink at that time:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2085&view=logs&j=ba53eb01-1462-56a3-8e98-0dd97fbcaab5&t=eb5f4d19-2d2d-5856-a4ce-acf5f904a994
{code}
2020-05-24T20:43:04.9430119Z [INFO] Building flink-hadoop-fs 1.11-SNAPSHOT
2020-05-24T20:43:04.9430595Z [INFO]
------------------------------------------------------------------------
2020-05-24T20:47:26.1287464Z [INFO]
2020-05-24T20:47:26.1308319Z [INFO] --- maven-checkstyle-plugin:2.17:check
(validate) @ flink-hadoop-fs ---
{code}
==> 4.5 minutes no output
*AlibabaCI003-agent07* was also compiling
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2085&view=logs&j=3b6ec2fd-a816-5e75-c775-06fb87cb6670&t=b33fdd4f-3de5-542e-2624-5d53167bb672
{code}
2020-05-24T20:46:04.4727910Z [INFO] --- scala-maven-plugin:3.2.2:compile
(scala-compile-first) @ flink-runtime_2.12 ---
2020-05-24T20:48:06.5131113Z [INFO] Nothing to compile - all classes are up to
date
{code}
==> 2 minutes no output
*AlibabaCI003-agent01* (test_cron_scala212 tests)
{code}
2020-05-24T20:44:28.3630097Z [INFO] Building flink-hadoop-fs 1.11-SNAPSHOT
2020-05-24T20:44:28.3630832Z [INFO]
------------------------------------------------------------------------
2020-05-24T20:47:12.7348902Z [INFO]
2020-05-24T20:47:12.7350119Z [INFO] --- maven-checkstyle-plugin:2.17:check
(validate) @ flink-hadoop-fs ---
{code}
==> 3.6 minutes no output
*AlibabaCI003-agent08* (test_cron_jdk11 core)
{code}
2020-05-24T20:46:15.9185266Z [INFO] --- scala-maven-plugin:3.2.2:compile
(scala-compile-first) @ flink-runtime_2.11 ---
2020-05-24T20:48:09.1374085Z [INFO] Nothing to compile - all classes are up to
date
{code}
=> 2 min
*AlibabaCI003-agent05*:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2084&view=logs&j=6bfdaf55-0c08-5e3f-a2d2-2a0285fd41cf&t=71a2d105-6e31-43b0-a3c1-093e49c4fc84
{code}
2020-05-24T20:45:03.0317850Z [INFO] Building flink-runtime 1.12-SNAPSHOT
2020-05-24T20:45:03.0318385Z [INFO]
------------------------------------------------------------------------
2020-05-24T20:48:17.1772639Z [INFO]
2020-05-24T20:48:17.1774062Z [INFO] --- git-commit-id-plugin:4.0.0:revision
(get-the-git-infos) @ flink-runtime_2.11 ---
{code}
=> 3 min
*AlibabaCI003-agent02*:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2084&view=logs&j=219e462f-e75e-506c-3671-5017d866ccf6&t=55a6756c-4b75-47ef-a21a-004b6fe3c810
{code}
2020-05-24T20:46:55.8276647Z Starting null#executeTest[statebackend type
=ROCKSDB_FULLY_ASYNC].
2020-05-24T20:47:09.7124370Z Finished null#executeTest[statebackend type
=ROCKSDB_FULLY_ASYNC].
{code}
=> 15 seconds
Out of the 8 builders:
1 Running our timed-out test
1 running some rocksdb-statebackend test
6 compiling (3 downloading dependencies)
We observe "no output" for 4 seconds to 270 seconds (4.5 minutes). I don't
think that the underlying infrastructure was undergoing some kind of migration.
I suspect the issue more on the OS level. I will continue investigating ...
> JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure failed
> because of Could not find Flink job
> -----------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-15661
> URL: https://issues.apache.org/jira/browse/FLINK-15661
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Tests
> Affects Versions: 1.11.0
> Reporter: Congxian Qiu(klion26)
> Priority: Critical
> Labels: test-stability
>
> 2020-01-19T06:25:02.3856954Z [ERROR]
> JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure:347 The
> program encountered a ExecutionException :
> org.apache.flink.runtime.rest.util.RestClientException:
> [org.apache.flink.runtime.rest.handler.RestHandlerException:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (47fe3e8df0e59994938485f683d1410e)
> 2020-01-19T06:25:02.3857171Z at
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler.propagateException(JobExecutionResultHandler.java:91)
> 2020-01-19T06:25:02.3857571Z at
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler.lambda$handleRequest$1(JobExecutionResultHandler.java:82)
> 2020-01-19T06:25:02.3857866Z at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> 2020-01-19T06:25:02.3857982Z at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
> 2020-01-19T06:25:02.3859852Z at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> 2020-01-19T06:25:02.3860440Z at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> 2020-01-19T06:25:02.3860732Z at
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:872)
> 2020-01-19T06:25:02.3860960Z at
> akka.dispatch.OnComplete.internal(Future.scala:263)
> 2020-01-19T06:25:02.3861099Z at
> akka.dispatch.OnComplete.internal(Future.scala:261)
> 2020-01-19T06:25:02.3861232Z at
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> 2020-01-19T06:25:02.3861391Z at
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> 2020-01-19T06:25:02.3861546Z at
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2020-01-19T06:25:02.3861712Z at
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> 2020-01-19T06:25:02.3861809Z at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 2020-01-19T06:25:02.3861916Z at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 2020-01-19T06:25:02.3862221Z at
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
> 2020-01-19T06:25:02.3862475Z at
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:23)
> 2020-01-19T06:25:02.3862626Z at
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
> 2020-01-19T06:25:02.3862736Z at
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> 2020-01-19T06:25:02.3862820Z at
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> 2020-01-19T06:25:02.3867146Z at
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2020-01-19T06:25:02.3867318Z at
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> 2020-01-19T06:25:02.3867441Z at
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> 2020-01-19T06:25:02.3867552Z at
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2020-01-19T06:25:02.3867664Z at
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2020-01-19T06:25:02.3867763Z at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> 2020-01-19T06:25:02.3867843Z at
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
> 2020-01-19T06:25:02.3867936Z at
> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> 2020-01-19T06:25:02.3868036Z at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
> 2020-01-19T06:25:02.3868145Z at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-01-19T06:25:02.3868223Z at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-01-19T06:25:02.3868313Z at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-01-19T06:25:02.3868390Z at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-01-19T06:25:02.3868520Z Caused by:
> java.util.concurrent.CompletionException:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (47fe3e8df0e59994938485f683d1410e)
> 2020-01-19T06:25:02.3868625Z at
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJobStatus$17(Dispatcher.java:516)
> 2020-01-19T06:25:02.3868734Z at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> 2020-01-19T06:25:02.3868831Z at
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
> 2020-01-19T06:25:02.3869143Z at
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
> 2020-01-19T06:25:02.3869241Z at
> org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:510)
> 2020-01-19T06:25:02.3869319Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-01-19T06:25:02.3869418Z at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-01-19T06:25:02.3869506Z at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-01-19T06:25:02.3869602Z at
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-01-19T06:25:02.3869681Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
> 2020-01-19T06:25:02.3869780Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)
> 2020-01-19T06:25:02.3869865Z at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-01-19T06:25:02.3869982Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> 2020-01-19T06:25:02.3870062Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-01-19T06:25:02.3870153Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-01-19T06:25:02.3870228Z at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> 2020-01-19T06:25:02.3870399Z at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-01-19T06:25:02.3870481Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> 2020-01-19T06:25:02.3870571Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-01-19T06:25:02.3870646Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-01-19T06:25:02.3870733Z at
> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> 2020-01-19T06:25:02.3870911Z at
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-01-19T06:25:02.3871013Z at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-01-19T06:25:02.3871086Z at
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-01-19T06:25:02.3871170Z at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-01-19T06:25:02.3871350Z at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-01-19T06:25:02.3871439Z at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-01-19T06:25:02.3871509Z ... 4 more
> 2020-01-19T06:25:02.3871618Z Caused by:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (47fe3e8df0e59994938485f683d1410e)
> 2020-01-19T06:25:02.3871721Z at
> org.apache.flink.runtime.dispatcher.Dispatcher.getJobMasterGatewayFuture(Dispatcher.java:776)
> 2020-01-19T06:25:02.3871827Z at
> org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:505)
> 2020-01-19T06:25:02.3871903Z ... 26 more
> 2020-01-19T06:25:02.3871975Z ]
>
> [https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_apis/build/builds/4461/logs/15]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)