[ 
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968360#comment-14968360
 ] 

Jeff Zhang commented on TEZ-2581:
---------------------------------



bq. I noticed the document doesn't discuss much about how user-provided code in 
the Tez AM, like the vertex manager plugins, will be involved in the recovery. 
It mentions a DummyVertexManager, but that isn't defined. Seems like the user 
code would need to participate in some way, or we're changing the semantics of 
the plugin API. 
In this design, I treat VM as completeness when Vertex#doneReconfiguringVertex 
is invoked (Currently VertexManager's side effect on Vertex is only on the 
parallelism). If VM is completeness state, we will use DummyVertexManager to 
replace it in recovery, otherwise will recover the vertex from scratch. Ideally 
we should provide API in VertexMangerPlugin to allow user to define the 
completeness, but it would cause api incompatibility. In this phase, I plan to 
not change the API, focus on stabilize the recovery framework. 

bq. The key in this process is that we can never allow any vertex to 
reconfigure parallelism after any task has started.
This is also not supported in the current code base. 
{code}
      if (!tasksNotYetScheduled) {
        String msg = "setParallelism cannot be called after scheduling tasks. 
Vertex: "
            + getLogIdentifier();
        LOG.info(msg);
        throw new TezUncheckedException(msg);
      }
{code}

bq. TEZ-2589's description implies there would be scenarios where we would 
recover a vertex from scratch, and doing so after any task has started in the 
previous DAG attempt can lead to data loss or duplication.
Since recovery will happen in another app attempt, the output data path will be 
different from last app attempt. Should be no data loss or duplication. 


Finally, [~jlowe] Thanks for the review on the design doc and patch. I have 
tested the patch on a small cluster by using some hive query for tpch. It would 
be helpful if you can try this patch on your cluster. Let me know if you need 
me rebase the patch for you tez version.



> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
>                 Key: TEZ-2581
>                 URL: https://issues.apache.org/jira/browse/TEZ-2581
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch, 
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, 
> TEZ-2581-WIP-6.patch, TezRecoveryRedesignProposal.pdf, 
> TezRecoveryRedesignV1.1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to