[ https://issues.apache.org/jira/browse/HIVE-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932152#comment-13932152 ]
Ashutosh Chauhan commented on HIVE-6638: ---------------------------------------- {color:blue} h3. How MR AM restart works? {color} If MR AM crashes, RM restarts AM (by default number of retrials = 2). Before another AM restarts job, it tries to recover work already done by previous AM. This is done in conjunction with help from OutputFormat. FileOutputFormat e.g., writes to a temp dir which uses AM attempt id to construct its destination path. When AM comes up, it has a new id. As a part of recovery, completed work of previous AM is moved to new paths which has attempt id of new AM. Than tasks which didn’t complete in previous attempt are restarted and begins to write into new paths which contains new AM attempt id. {color:blue} h3. What Hive needs to do to adapt to AM restart? {color} For Hive to support recovery for AM restart case, following changes are required: a) Hive always set HiveOutputFormatImpl as its OutputFormat. In this OF, we will implement newly introduced isRecoverySupported() and recoverTask() methods. b) FileSink operator is passed in specPath where it does all its writing. It first writes to a tmp location alongside it and at the end of task promotes output to specPath location. This tmp location within specPath needs to be modified to take into account AM attempt id. c) in recoverTask() method of OF, all the committed output of previous AM needs to be moved under new JobAttemptId. This is very similar to FileOutputCommitter::recoverTask. d) In commitJob(), we need to promote from AM attempt id dir to dir under specPath. *Note* testing needs to be done for following areas for this feature * Insertion into Dynamic partition * Insertion into bucketed table * Secure setup > Hive needs to implement recovery for Application Master restart > ---------------------------------------------------------------- > > Key: HIVE-6638 > URL: https://issues.apache.org/jira/browse/HIVE-6638 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Affects Versions: 0.11.0, 0.12.0, 0.13.0 > Reporter: Ashutosh Chauhan > > Currently, if AM restarts, whole job is restarted. Although, job and > subsequently query would still finish to completion, it would be nice if Hive > don't need to redo all the work done under previous AM. -- This message was sent by Atlassian JIRA (v6.2#6252)