[ 
https://issues.apache.org/jira/browse/FLINK-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135925#comment-17135925
 ] 

Robert Metzger commented on FLINK-15661:
----------------------------------------

Thanks a lot for investigating the issue.
What is indeed concerning are log lines like this (on *AlibabaCI003-agent03*)
{code}
20:46:56,889 [                main] INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils        [] - Actor system 
started at akka.tcp://[email protected]:46158
20:47:00,486 [main-SendThread(127.0.0.1:43668)] WARN  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client 
session timed out, have not heard from server in 4040ms for sessionid 
0x1724870252b0004
{code}
Where nothing seems to be happening for ~4 seconds.

I have observed other "hard to explain" networking issues on the 
AliCloud-hosted CI machines. So far, I have suspected an issue in the network 
stack, but maybe I need to consider other OS components as well (process 
scheduler, docker, the underlying "hardware").

Why Scheduler? 
Machines are running up to 8 test jobs concurrently, on 32 cores. Not sure if 
the Kernel sometimes has some hiccups scheduling this many workloads.

Why Docker?
The machines run an older (still supported) CentOs with a Kernel from 2013, 
with Docker from 2019. It seems that Linux 3.10 (which we are using) is the 
first to be supported by Docker.

Why underlying hardware?
It's a cloud-hosted machine, which sometimes get's live-migrated (example: 
https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#live_migrate)

What were the other 7 builders doing on the machine at that time?

 *AlibabaCI003-agent04* 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2085&view=logs&j=6bfdaf55-0c08-5e3f-a2d2-2a0285fd41cf&t=93018cfc-6498-565b-e034-4b68eb90fc80
{code}
2020-05-24T20:46:56.5617885Z [INFO] Tests run: 16, Failures: 0, Errors: 0, 
Skipped: 0, Time elapsed: 0.164 s - in 
org.apache.flink.api.common.typeutils.base.LocalDateTimeSerializerTest
2020-05-24T20:46:56.5623520Z [INFO] Running 
org.apache.flink.api.common.typeutils.base.LocalDateSerializerTest
2020-05-24T20:47:00.5093823Z [INFO] Tests run: 16, Failures: 0, Errors: 0, 
Skipped: 0, Time elapsed: 0.142 s - in 
org.apache.flink.api.common.typeutils.base.LocalDateSerializerTest
2020-05-24T20:47:00.5094515Z [INFO] Running 
org.apache.flink.api.common.typeutils.base.ShortComparatorTest
{code}
==> 4 seconds no output

*AlibabaCI003-agent06* was compiling Flink at that time: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2085&view=logs&j=ba53eb01-1462-56a3-8e98-0dd97fbcaab5&t=eb5f4d19-2d2d-5856-a4ce-acf5f904a994
{code}
2020-05-24T20:43:04.9430119Z [INFO] Building flink-hadoop-fs 1.11-SNAPSHOT
2020-05-24T20:43:04.9430595Z [INFO] 
------------------------------------------------------------------------
2020-05-24T20:47:26.1287464Z [INFO] 
2020-05-24T20:47:26.1308319Z [INFO] --- maven-checkstyle-plugin:2.17:check 
(validate) @ flink-hadoop-fs ---
{code}
==> 4.5 minutes no output

*AlibabaCI003-agent07* was also compiling 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2085&view=logs&j=3b6ec2fd-a816-5e75-c775-06fb87cb6670&t=b33fdd4f-3de5-542e-2624-5d53167bb672
{code}
2020-05-24T20:46:04.4727910Z [INFO] --- scala-maven-plugin:3.2.2:compile 
(scala-compile-first) @ flink-runtime_2.12 ---
2020-05-24T20:48:06.5131113Z [INFO] Nothing to compile - all classes are up to 
date
{code}
==> 2 minutes no output

*AlibabaCI003-agent01* (test_cron_scala212 tests)
{code}
2020-05-24T20:44:28.3630097Z [INFO] Building flink-hadoop-fs 1.11-SNAPSHOT
2020-05-24T20:44:28.3630832Z [INFO] 
------------------------------------------------------------------------
2020-05-24T20:47:12.7348902Z [INFO] 
2020-05-24T20:47:12.7350119Z [INFO] --- maven-checkstyle-plugin:2.17:check 
(validate) @ flink-hadoop-fs ---
{code}
==> 3.6 minutes no output

*AlibabaCI003-agent08* (test_cron_jdk11 core)
{code}
2020-05-24T20:46:15.9185266Z [INFO] --- scala-maven-plugin:3.2.2:compile 
(scala-compile-first) @ flink-runtime_2.11 ---
2020-05-24T20:48:09.1374085Z [INFO] Nothing to compile - all classes are up to 
date
{code}
=> 2 min

*AlibabaCI003-agent05*: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2084&view=logs&j=6bfdaf55-0c08-5e3f-a2d2-2a0285fd41cf&t=71a2d105-6e31-43b0-a3c1-093e49c4fc84
{code}
2020-05-24T20:45:03.0317850Z [INFO] Building flink-runtime 1.12-SNAPSHOT
2020-05-24T20:45:03.0318385Z [INFO] 
------------------------------------------------------------------------
2020-05-24T20:48:17.1772639Z [INFO] 
2020-05-24T20:48:17.1774062Z [INFO] --- git-commit-id-plugin:4.0.0:revision 
(get-the-git-infos) @ flink-runtime_2.11 ---
{code}
=> 3 min

*AlibabaCI003-agent02*: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2084&view=logs&j=219e462f-e75e-506c-3671-5017d866ccf6&t=55a6756c-4b75-47ef-a21a-004b6fe3c810
{code}
2020-05-24T20:46:55.8276647Z Starting null#executeTest[statebackend type 
=ROCKSDB_FULLY_ASYNC].
2020-05-24T20:47:09.7124370Z Finished null#executeTest[statebackend type 
=ROCKSDB_FULLY_ASYNC].
{code}
=> 15 seconds

Out of the 8 builders:
1 Running our timed-out test
1 running some rocksdb-statebackend test
6 compiling (3 downloading dependencies)

We observe "no output" for 4 seconds to 270 seconds (4.5 minutes). I don't 
think that the underlying infrastructure was undergoing some kind of migration.

I suspect the issue more on the OS level. I will continue investigating ...

> JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure failed 
> because of Could not find Flink job 
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15661
>                 URL: https://issues.apache.org/jira/browse/FLINK-15661
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.11.0
>            Reporter: Congxian Qiu(klion26)
>            Priority: Critical
>              Labels: test-stability
>
> 2020-01-19T06:25:02.3856954Z [ERROR] 
> JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure:347 The 
> program encountered a ExecutionException : 
> org.apache.flink.runtime.rest.util.RestClientException: 
> [org.apache.flink.runtime.rest.handler.RestHandlerException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (47fe3e8df0e59994938485f683d1410e)
>  2020-01-19T06:25:02.3857171Z at 
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler.propagateException(JobExecutionResultHandler.java:91)
>  2020-01-19T06:25:02.3857571Z at 
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler.lambda$handleRequest$1(JobExecutionResultHandler.java:82)
>  2020-01-19T06:25:02.3857866Z at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>  2020-01-19T06:25:02.3857982Z at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>  2020-01-19T06:25:02.3859852Z at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  2020-01-19T06:25:02.3860440Z at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  2020-01-19T06:25:02.3860732Z at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:872)
>  2020-01-19T06:25:02.3860960Z at 
> akka.dispatch.OnComplete.internal(Future.scala:263)
>  2020-01-19T06:25:02.3861099Z at 
> akka.dispatch.OnComplete.internal(Future.scala:261)
>  2020-01-19T06:25:02.3861232Z at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
>  2020-01-19T06:25:02.3861391Z at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
>  2020-01-19T06:25:02.3861546Z at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>  2020-01-19T06:25:02.3861712Z at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
>  2020-01-19T06:25:02.3861809Z at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>  2020-01-19T06:25:02.3861916Z at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>  2020-01-19T06:25:02.3862221Z at 
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
>  2020-01-19T06:25:02.3862475Z at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:23)
>  2020-01-19T06:25:02.3862626Z at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
>  2020-01-19T06:25:02.3862736Z at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
>  2020-01-19T06:25:02.3862820Z at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
>  2020-01-19T06:25:02.3867146Z at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>  2020-01-19T06:25:02.3867318Z at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
>  2020-01-19T06:25:02.3867441Z at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
>  2020-01-19T06:25:02.3867552Z at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>  2020-01-19T06:25:02.3867664Z at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>  2020-01-19T06:25:02.3867763Z at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>  2020-01-19T06:25:02.3867843Z at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
>  2020-01-19T06:25:02.3867936Z at 
> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>  2020-01-19T06:25:02.3868036Z at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>  2020-01-19T06:25:02.3868145Z at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  2020-01-19T06:25:02.3868223Z at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  2020-01-19T06:25:02.3868313Z at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  2020-01-19T06:25:02.3868390Z at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  2020-01-19T06:25:02.3868520Z Caused by: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (47fe3e8df0e59994938485f683d1410e)
>  2020-01-19T06:25:02.3868625Z at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJobStatus$17(Dispatcher.java:516)
>  2020-01-19T06:25:02.3868734Z at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>  2020-01-19T06:25:02.3868831Z at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>  2020-01-19T06:25:02.3869143Z at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>  2020-01-19T06:25:02.3869241Z at 
> org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:510)
>  2020-01-19T06:25:02.3869319Z at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  2020-01-19T06:25:02.3869418Z at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  2020-01-19T06:25:02.3869506Z at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  2020-01-19T06:25:02.3869602Z at 
> java.lang.reflect.Method.invoke(Method.java:498)
>  2020-01-19T06:25:02.3869681Z at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
>  2020-01-19T06:25:02.3869780Z at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)
>  2020-01-19T06:25:02.3869865Z at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  2020-01-19T06:25:02.3869982Z at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>  2020-01-19T06:25:02.3870062Z at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  2020-01-19T06:25:02.3870153Z at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  2020-01-19T06:25:02.3870228Z at 
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>  2020-01-19T06:25:02.3870399Z at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  2020-01-19T06:25:02.3870481Z at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>  2020-01-19T06:25:02.3870571Z at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  2020-01-19T06:25:02.3870646Z at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  2020-01-19T06:25:02.3870733Z at 
> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>  2020-01-19T06:25:02.3870911Z at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>  2020-01-19T06:25:02.3871013Z at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>  2020-01-19T06:25:02.3871086Z at 
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
>  2020-01-19T06:25:02.3871170Z at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>  2020-01-19T06:25:02.3871350Z at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>  2020-01-19T06:25:02.3871439Z at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  2020-01-19T06:25:02.3871509Z ... 4 more
>  2020-01-19T06:25:02.3871618Z Caused by: 
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (47fe3e8df0e59994938485f683d1410e)
>  2020-01-19T06:25:02.3871721Z at 
> org.apache.flink.runtime.dispatcher.Dispatcher.getJobMasterGatewayFuture(Dispatcher.java:776)
>  2020-01-19T06:25:02.3871827Z at 
> org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:505)
>  2020-01-19T06:25:02.3871903Z ... 26 more
>  2020-01-19T06:25:02.3871975Z ]
>  
> [https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_apis/build/builds/4461/logs/15]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to