[ 
https://issues.apache.org/jira/browse/FLINK-34343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813902#comment-17813902
 ] 

Chesnay Schepler commented on FLINK-34343:
------------------------------------------

[~mapohl] and I looked at this together and concluded that his is one hell of a 
race condition. There's a short window where the underlying infrastructure for 
_receiving_ messages already exists, without the rpcServer field being set yet 
in the RpcEndpoint.
While these messages are correctly rejected by the RpcActor before being passed 
to the RpcEndpoint (as it hasn't started yet), for logging purposes we access 
the RpcEndpoint's address; but that one is contingent of the rpcServer field 
being set.
In the end it boils down to accesses to the RpcEndpoint while it is still being 
set up.

An easy fix is to just not use the address but the actor path.

> ResourceManager registration is not completed when registering the JobMaster
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-34343
>                 URL: https://issues.apache.org/jira/browse/FLINK-34343
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / RPC
>    Affects Versions: 1.17.2, 1.19.0, 1.18.1
>            Reporter: Matthias Pohl
>            Priority: Critical
>              Labels: test-stability
>         Attachments: FLINK-34343_k8s_application_cluster_e2e_test.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57203&view=logs&j=64debf87-ecdb-5aef-788d-8720d341b5cb&t=2302fb98-0839-5df2-3354-bbae636f81a7&l=8066
> The test run failed due to a NullPointerException:
> {code}
> Feb 02 01:11:55 2024-02-02 01:11:47,791 INFO  
> org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor       [] - The rpc 
> endpoint 
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager has not 
> been started yet. Discarding message LocalFencedMessage(0000000000000000000
> 0000000000000, 
> LocalRpcInvocation(ResourceManagerGateway.registerJobMaster(JobMasterId, 
> ResourceID, String, JobID, Time))) until processing is started.
> Feb 02 01:11:55 2024-02-02 01:11:47,797 WARN  
> org.apache.flink.runtime.rpc.pekko.SupervisorActor           [] - RpcActor 
> pekko://flink/user/rpc/resourcemanager_2 has failed. Shutting it down now.
> Feb 02 01:11:55 java.lang.NullPointerException: Cannot invoke 
> "org.apache.flink.runtime.rpc.RpcServer.getAddress()" because 
> "this.rpcServer" is null
> Feb 02 01:11:55         at 
> org.apache.flink.runtime.rpc.RpcEndpoint.getAddress(RpcEndpoint.java:322) 
> ~[flink-dist-1.19-SNAPSHOT.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:182)
>  ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> scala.PartialFunction.applyOrElse(PartialFunction.scala:127) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
>  ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at 
> org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253) 
> ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
> Feb 02 01:11:55         at java.util.concurrent.ForkJoinTask.doExec(Unknown 
> Source) ~[?:?]
> Feb 02 01:11:55         at 
> java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) 
> ~[?:?]
> Feb 02 01:11:55         at java.util.concurrent.ForkJoinPool.scan(Unknown 
> Source) ~[?:?]
> Feb 02 01:11:55         at 
> java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) ~[?:?]
> Feb 02 01:11:55         at 
> java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) ~[?:?]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to