[GitHub] [tez] abstractdog commented on a diff in pull request #266: TEZ-4474: Added config to fail the DAG status when recovery data is missing

via GitHub Wed, 22 Feb 2023 00:31:15 -0800


abstractdog commented on code in PR #266:
URL: https://github.com/apache/tez/pull/266#discussion_r1113971444



##########
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java:
##########
@@ -1919,6 +1919,13 @@ private DAGRecoveryData recoverDAG() throws IOException, 
TezException {
           RecoveryParser recoveryParser = new RecoveryParser(
               this, recoveryFS, recoveryDataDir, appAttemptID.getAttemptId());
           DAGRecoveryData recoveredDAGData = 
recoveryParser.parseRecoveryData();
+
+          if(Objects.isNull(recoveredDAGData) && amConf.getBoolean(
+                  TezConfiguration.TEZ_AM_FAILURE_ON_MISSING_RECOVERY_DATA,
+                  
TezConfiguration.TEZ_AM_FAILURE_ON_MISSING_RECOVERY_DATA_DEFAULT)) {
+            throw new IOException("Found nothing to recover in 
currentAttemptId= "

Review Comment:
   a couple of comments here:
   
   1. I think it would make sense to extract the whole parse logic to another 
method (from the instantiation of the recoveryParser) to prevent recoverDAG 
grow further
   
   2. regarding the actual fix: Objects.isNull tells almost nothing about the 
scenario that has been discussed in jira, so you should describe it in a code 
comment here, touching the most important points (AM shutdown prematurely, so 
as the recovery service, recovery stream is not closed, there is no valid 
recovery data, recovery parser returns null instead of failing, etc.), also, 
describe the assumption here: callers of this method should catch this 
exception and make the AM finish with state of DAGAppMasterState.ERROR (this is 
currently true, but after a refactor, it can simply collide, as we don't do the 
actual AM failure logic here)
   
   3. what about having the recoveryDataDir path in the exception message to 
help the log reader investigate further?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tez] abstractdog commented on a diff in pull request #266: TEZ-4474: Added config to fail the DAG status when recovery data is missing

Reply via email to