[
https://issues.apache.org/jira/browse/TEZ-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437400#comment-16437400
]
Jason Lowe commented on TEZ-3914:
---------------------------------
Thanks for the report and patch! Many of the unit test failures are related.
Could you elaborate a bit more on the approach taken for the fix? It's a
rather sizeable patch, and a high-level overview would help for the review.
> Recovering a large DAG hang job
> -------------------------------
>
> Key: TEZ-3914
> URL: https://issues.apache.org/jira/browse/TEZ-3914
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jonathan Eagles
> Assignee: Jonathan Eagles
> Priority: Major
> Attachments: TEZ-3914.001.patch, TEZ-3914.002.patch
>
>
> Any failure to parse recovery event is ignore and treated as eof. Job can
> hang since some task completions may be missed and shuffle will hang.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)