Ming Ma created TEZ-3239:
----------------------------

             Summary: ShuffleVertexManager recovery issue when auto parallelism 
is enabled
                 Key: TEZ-3239
                 URL: https://issues.apache.org/jira/browse/TEZ-3239
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Ming Ma


Repro:

* Enable {{tez.shuffle-vertex-manager.enable.auto-parallel}}.
* kill the Tez AM container after the job has reached to the point that VM has 
reconfigured the Edge.
* The new Tez AM attempt will fail to the following error.

{noformat}
org.apache.tez.dag.api.TezUncheckedException: Atleast 1 bipartite source should 
exist
at 
org.apache.tez.dag.library.vertexmanager.ShuffleVertexManager.onVertexStarted(ShuffleVertexManager.java:497)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventOnVertexStarted.invoke(VertexManager.java:589)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:658)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:653)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
{noformat}

That is because the edge routing type changed to {{DataMovementType.CUSTOM}} 
after reconfiguration. Allowing {{DataMovementType.CUSTOM}} in the following 
check seems to fix the issue.

{noformat}
      if (entry.getValue().getDataMovementType() == 
DataMovementType.SCATTER_GATHER) {
        bipartiteSources++;
      }
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to