[ https://issues.apache.org/jira/browse/TEZ-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor resolved TEZ-4488. ------------------------------- Resolution: Fixed > TaskSchedulerManager might not be initialized when the first DAG comes > ---------------------------------------------------------------------- > > Key: TEZ-4488 > URL: https://issues.apache.org/jira/browse/TEZ-4488 > Project: Apache Tez > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Fix For: 0.10.3 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > {code} > query-coordinator <11>1 2023-04-03T12:54:12.056Z query-coordinator-0-0 > query-coordinator 1 10ea11e4-d4dc-4231-878e-0c8c07eda53b [mdc@18060 > class="impl.DAGImpl" level="ERROR" thread="IPC Server handler 1 on 22222"] > Uncaught Exception when handling event DAG_INIT on Dag > dag_1680526446742_0000_1 at currentState=NEW > java.lang.NullPointerException > at > org.apache.tez.dag.app.rm.TaskSchedulerManager.getTaskSchedulerClassName(TaskSchedulerManager.java:1082) > at > org.apache.tez.dag.app.DAGAppMaster$RunningAppContext.getTaskSchedulerClassName(DAGAppMaster.java:1702) > at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:1061) > at org.apache.tez.dag.app.dag.impl.DAGImpl.createVertex(DAGImpl.java:1741) > at > org.apache.tez.dag.app.dag.impl.DAGImpl.initializeDAG(DAGImpl.java:1596) > at > org.apache.tez.dag.app.dag.impl.DAGImpl$InitTransition.transition(DAGImpl.java:1869) > at > org.apache.tez.dag.app.dag.impl.DAGImpl$InitTransition.transition(DAGImpl.java:1846) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:59) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1219) > at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:158) > at > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2231) > at > org.apache.tez.dag.app.DAGAppMaster.startDAGExecution(DAGAppMaster.java:2608) > at org.apache.tez.dag.app.DAGAppMaster.startDAG(DAGAppMaster.java:2573) > at > org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1379) > at > org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:145) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:187) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8519) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917) > at java.base/java.security.AccessController.doPrivileged(Native Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894) > {code} > currently, [TaskSchedulerManager depends on > clientRpcServer|https://github.com/apache/tez/blob/9a729cd62d2dabb79d54ff5b5bcc696bf7344489/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L588], > so TaskSchedulerManager waits for clientRpcServer to start, but once > clientRpcServer is initialized, it can handle requests from e.g. HiveServer2 > (so HiveServer2 is able to submit a dag) even before taskSchedulerManager is > initialized > we cannot change the order of service dependency as the TaskSchedulerManager > [needs the app > host/port|https://github.com/apache/tez/blob/9a729cd62d2dabb79d54ff5b5bcc696bf7344489/tez-dag/src/main/java/org/apache/tez/dag/app/rm/TaskSchedulerManager.java#L646], > so we might want to simply block the very-first DAG to be submitted while > TaskSchedulerManager is not ready > to solve this dependency cycle, my proposal is to introduce a service that > can depend on services that must start and gets initialized before the first > DAG comes, and its state can be checked before DAG submission, so basically I > introduced a directed dependency graph like below: > {code} > appMasterReadinessService -> taskSchedulerManager -> clientRpcServer > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)