[
https://issues.apache.org/jira/browse/FLINK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340700#comment-17340700
]
Matthias commented on FLINK-22566:
----------------------------------
It appears to be a infrastructure issue. My initial investigation was based on
an assumption that we're using {{AdaptiveScheduler}} which would pick up the
work even if we're not having enough resources available. Even though, the logs
prove that that's not the case. The {{DefaultScheduler}} is used which fails if
we cannot provide all the slots requested for the job.
The jobs ask for 3 slots. Hence, three TaskManagers are spun up. But we're
seeing some lack between the YARN workers becoming available (shown in YARN's
ResourceManager logs) and the TaskExecutors actually starting. I updated the
findings [in the comment
above|https://issues.apache.org/jira/browse/FLINK-22566?focusedCommentId=17340158&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340158]
to reflect that. I also verified the findings for a few of the builds
mentioned in this ticket (namely [build
#408|https://dev.azure.com/mapohl/6a072220-8d55-43e8-85fc-08397ab083d1/_apis/build/builds/408/logs/135],
[build
#17552|https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/17552/logs/492]
and [build
#17557|https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/17557/logs/171]).
I didn't find other resources/logs to dig deeper into what's going on between
the YARN containers becoming available and the TaskManagers actually starting
on the Flink side.
We were not able to reproduce the issue on AzureCI [with looping over one of
the Kerberized YARN e2e
tests|https://dev.azure.com/mapohl/flink/_build/results?buildId=412&view=results].
>From what we found so far it looks like the test failures are caused by some
>(quite huge?; close to 2mins) infrastructure lag which we cannot explain due
>to missing logs. I initiated a [second test loop
>run|https://dev.azure.com/mapohl/flink/_build/results?buildId=413&view=results].
> Running Kerberized YARN application on Docker test (default input) fails with
> no resources
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-22566
> URL: https://issues.apache.org/jira/browse/FLINK-22566
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.13.0
> Reporter: Dawid Wysakowicz
> Assignee: Matthias
> Priority: Blocker
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17558&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529&l=8745
> {code}
> May 05 01:29:04 Caused by: java.util.concurrent.TimeoutException: Timeout has
> occurred: 120000 ms
> May 05 01:29:04 at
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_292]
> May 05 01:29:04 at
> java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_292]
> May 05 01:29:04 at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 ... 4 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)