[ 
https://issues.apache.org/jira/browse/FLINK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340700#comment-17340700
 ] 

Matthias commented on FLINK-22566:
----------------------------------

It appears to be a infrastructure issue. My initial investigation was based on 
an assumption that we're using {{AdaptiveScheduler}} which would pick up the 
work even if we're not having enough resources available. Even though, the logs 
prove that that's not the case. The {{DefaultScheduler}} is used which fails if 
we cannot provide all the slots requested for the job.

The jobs ask for 3 slots. Hence, three TaskManagers are spun up. But we're 
seeing some lack between the YARN workers becoming available (shown in YARN's 
ResourceManager logs) and the TaskExecutors actually starting. I updated the 
findings [in the comment 
above|https://issues.apache.org/jira/browse/FLINK-22566?focusedCommentId=17340158&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340158]
 to reflect that. I also verified the findings for a few of the builds 
mentioned in this ticket (namely [build 
#408|https://dev.azure.com/mapohl/6a072220-8d55-43e8-85fc-08397ab083d1/_apis/build/builds/408/logs/135],
 [build 
#17552|https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/17552/logs/492]
 and [build 
#17557|https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/17557/logs/171]).

I didn't find other resources/logs to dig deeper into what's going on between 
the YARN containers becoming available and the TaskManagers actually starting 
on the Flink side.

We were not able to reproduce the issue on AzureCI [with looping over one of 
the Kerberized YARN e2e 
tests|https://dev.azure.com/mapohl/flink/_build/results?buildId=412&view=results].

>From what we found so far it looks like the test failures are caused by some 
>(quite huge?; close to 2mins) infrastructure lag which we cannot explain due 
>to missing logs. I initiated a [second test loop 
>run|https://dev.azure.com/mapohl/flink/_build/results?buildId=413&view=results].

> Running Kerberized YARN application on Docker test (default input) fails with 
> no resources
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22566
>                 URL: https://issues.apache.org/jira/browse/FLINK-22566
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.13.0
>            Reporter: Dawid Wysakowicz
>            Assignee: Matthias
>            Priority: Blocker
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17558&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529&l=8745
> {code}
> May 05 01:29:04 Caused by: java.util.concurrent.TimeoutException: Timeout has 
> occurred: 120000 ms
> May 05 01:29:04       at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86)
>  ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[?:1.8.0_292]
> May 05 01:29:04       at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_292]
> May 05 01:29:04       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
>  ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
>  ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
>  ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
>  ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> akka.actor.Actor$class.aroundReceive(Actor.scala:517) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04       ... 4 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to