[
https://issues.apache.org/jira/browse/FLINK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340158#comment-17340158
]
Matthias commented on FLINK-22566:
----------------------------------
For transparency reasons, here are my findings based on the [failed build
#17606|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17606&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529]
[~pnowojski] shared:
{code:java}
* Test
* YARN nodes come up successfully (line 7368)
* Job is identified as FAILED by the test at 5:36:40 (line 7514)
* YARN's HistoryServer retrieves a connection refused error at 5:31:05 (line
7572)
* succeeds at 5:31:16 (line 7617) but fails to initialize (line 7619)
* Error creating done directory:
[hdfs://master.docker-hadoop-cluster-network:9000/tmp/hadoop-yarn/staging/history/done]
at 5:31:17 (line 7619)
* Permission denied: user=mapred, access=WRITE,
inode="/":hdfs:root:drwxr-xr-x
* indicates that there's a not allowed write operation on / having hdfs as
an owner and root as the group
* user is mapred
* HistoryServer stops as a consequence at 5:31:17 (line 7797): FATAL error
reported at 5:31:17 (line 7890)
=> learning: we have constantly issues with YARN's history servers. This
error was also shown when looking for older failed Kerberized YARN e2e tests.
* YARN's namenode logs look fine and stop at 5:36:37 (line 8296)
* YARN's resourcemanager logs look fine and stop at 5:36:40 (line 8611)
* YARN's timeline server logs look fine and stop at 5:36:40 (line 8687)
* Docker logs
* namenode logs indicate some normal NameNode shutdown at 5:30:30 (line 8761)
* NameNode address master.docker-hadoop-cluster-network/172.21.0.3 in
shutdown message (line 8764)
* Connection refused error at 5:31:04 accessing
master.docker-hadoop-cluster-network/172.21.0.3 (line 8769)
* Line 8803: From master.docker-hadoop-cluster-network/172.21.0.3 to
master.docker-hadoop-cluster-network:9000 failed on connection exception:
java.net.ConnectException
* Normal error message according to
https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused
* Finished master initialization at 5:36:41 (line 8873)
* Flink logs
* TaskExecutor #0 started at 5:36:21 (line 8904)
* container ID: container_1620279065731_0001_01_000004
* no errors
* Received SIGTERM at 5:36:33 (line 9017)
* TaskExecutor #1 started at 5:36:21 (line 9047)
* container ID: container_1620279065731_0001_01_000003
* no errors
* Received SIGTERM at 5:36:34 (line 9164)
* TaskExecutor #2 started at 5:34:41 (line 9196)
* container ID: container_1620279065731_0001_01_000002
* connected to JobManager for job e487226aebee9bced1a45881914ecf8e at
5:34:57 (line 9332)
* Slot is offered, activated, and freed afterwards
* but no Task logging indicates that the Task was executed
* Job is removed from job leader monitoring at 5:36:32 (line 9336)
* no errors
* Received SIGTERM at 5:36:33 (line 9338)
* JobManager started at 5:34:24 (line 9372)
* Worker container_1620279065731_0001_01_000002 is registered at 5:34:56
(line 9563)
* NoResourceAvailableException is thrown after timeout of 120s at 5:36:32
(line 9565)
* No restart strategy is making the job switch into FAILED state at 5:36:32
(line 9695)
* JobManager stopped at 5:36:34 (line 10045)
{code}
We've seen similar behavior in FLINK-21859 where the Task was not executed on
the TaskExecutor's side. We're proceeding with the investigation by rerunning
the e2e test on AzureCI having the log level set to {{DEBUG}}.
> Running Kerberized YARN application on Docker test (default input) fails with
> no resources
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-22566
> URL: https://issues.apache.org/jira/browse/FLINK-22566
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.13.0
> Reporter: Dawid Wysakowicz
> Assignee: Matthias
> Priority: Blocker
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17558&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529&l=8745
> {code}
> May 05 01:29:04 Caused by: java.util.concurrent.TimeoutException: Timeout has
> occurred: 120000 ms
> May 05 01:29:04 at
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_292]
> May 05 01:29:04 at
> java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_292]
> May 05 01:29:04 at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 ... 4 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)