Hi folks, I am running a SQL-like query (similar to TPC-H Q3) on a prototype of Flink-on-Tez. I am using the latest Tez master. It runs fine on smaller scales (e.g., TPC-H scale 500), and I am getting the following error at scale 1250:
15/01/08 22:07:42 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running 15/01/08 22:07:43 INFO client.DAGClientImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 12092 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 15/01/08 22:07:47 INFO client.DAGClientImpl: DAG: State: FAILED Progress: 0% TotalTasks: 12092 Succeeded: 0 Running: 0 Failed: 0 Killed: 1246 15/01/08 22:07:47 INFO client.DAGClientImpl: DAG completed. FinalState=FAILED 15/01/08 22:07:47 ERROR client.TezExecutor: Flink Java Job at Thu Jan 08 22:07:33 CET 2015 failed with diagnostics: [Vertex failed, vertexName=DataSource (at getCustomerDataSet(TPCHQuery3.java:217) (org.apache.flink.api.java.io.CsvInputFormat))032ae46b-6bb9-4dcf-91c4-7e49b49f6546, vertexId=vertex_1420727594991_0021_1_01, diagnostics=[Exception in VertexManager, vertex=vertex_1420727594991_0021_1_01 [DataSource (at getCustomerDataSet(TPCHQuery3.java:217) (org.apache.flink.api.java.io.CsvInputFormat))032ae46b-6bb9-4dcf-91c4-7e49b49f6546],java.lang.NullPointerException at org.apache.tez.dag.app.dag.impl.VertexImpl.handleRoutedTezEvents(VertexImpl.java:3857) at org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1265) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerPluginContextImpl.scheduleVertexTasks(VertexManager.java:144) at org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.scheduleTasks(ImmediateStartVertexManager.java:100) at org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.onVertexStarted(ImmediateStartVertexManager.java:75) at org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:365) at org.apache.tez.dag.app.dag.impl.VertexImpl.startVertex(VertexImpl.java:3320) at org.apache.tez.dag.app.dag.impl.VertexImpl.access$5100(VertexImpl.java:183) at org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:3293) at org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:3285) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1587) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:182) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1763) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1749) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:116) at java.lang.Thread.run(Thread.java:745) It may well be that this is a problem on the Flink side of things (albeit not visible in the stack trace), I was just wondering if the error rings a bell. Since I am creating an input task per input split, this job results in quite a few of input tasks. The logs contain one more exception [1] caused by the exception above. Best, Kostas [1] 2015-01-08 22:07:44,448 ERROR [Dispatcher thread: Central] impl.VertexImpl: Exception in VertexManager, vertex=vertex_1420727594991_0021_1_01 [DataSource (at getCustomerDataSet(TPCHQuery3.java:217) (org.apache.flink.api.java.io.CsvInputFormat))032ae46b-6bb9-4dcf-91c4-7e49b49f6546] org.apache.tez.dag.app.dag.impl.AMUserCodeException: java.lang.NullPointerException at org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:368) at org.apache.tez.dag.app.dag.impl.VertexImpl.startVertex(VertexImpl.java:3320) at org.apache.tez.dag.app.dag.impl.VertexImpl.access$5100(VertexImpl.java:183) at org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:3293) at org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:3285) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1587) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:182) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1763) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1749) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:116) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.tez.dag.app.dag.impl.VertexImpl.handleRoutedTezEvents(VertexImpl.java:3857) at org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1265) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerPluginContextImpl.scheduleVertexTasks(VertexManager.java:144) at org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.scheduleTasks(ImmediateStartVertexManager.java:100) at org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.onVertexStarted(ImmediateStartVertexManager.java:75) at org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:365) ... 16 more
