[ 
https://issues.apache.org/jira/browse/GOBBLIN-2181?focusedWorklogId=948317&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-948317
 ]

ASF GitHub Bot logged work on GOBBLIN-2181:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Dec/24 16:58
            Start Date: 13/Dec/24 16:58
    Worklog Time Spent: 10m 
      Work Description: phet commented on code in PR #4084:
URL: https://github.com/apache/gobblin/pull/4084#discussion_r1884208900


##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagProcessingEngine.java:
##########
@@ -149,6 +158,16 @@ public void run() {
           dagTask.conclude();
           log.info(dagProc.contextualizeStatus("concluded dagTask"));
         } catch (Exception e) {
+          log.error("DagProcEngineThread: " + 
dagProc.contextualizeStatus("error"), e);
+          if (!DagProcessingEngine.isTransientException(e)) {
+            DagActionStore.DagAction dagAction = dagTask.getDagAction();
+            if (dagAction != null) {
+              log.warn(dagProc.contextualizeStatus("ignoring non-transient 
exception by concluding so no retries"));
+            }
+            
dagManagementStateStore.getDagManagerMetrics().dagProcessingNonRetryableExceptionMeter.mark();
+            dagTask.conclude();
+          }
+          // TODO add the else block for transient exceptions and add conclude 
task only if retry limit is not breached
           log.error("DagProcEngineThread: " + 
dagProc.contextualizeStatus("error"), e);

Review Comment:
   would be clearer to reorder and present the exception w/ stack trace first, 
before describing how the non-transient subset are getting special handling



##########
gobblin-service/src/test/java/org/apache/gobblin/service/modules/orchestration/proc/DagProcUtilsTest.java:
##########
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.gobblin.service.modules.orchestration.proc;
+
+import com.typesafe.config.Config;
+import com.typesafe.config.ConfigFactory;
+import com.typesafe.config.ConfigValueFactory;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.concurrent.ExecutionException;
+import java.util.stream.Collectors;
+import org.apache.gobblin.config.ConfigBuilder;
+import org.apache.gobblin.configuration.ConfigurationKeys;
+import org.apache.gobblin.runtime.api.FlowSpec;
+import org.apache.gobblin.runtime.api.JobSpec;
+import org.apache.gobblin.runtime.api.Spec;
+import org.apache.gobblin.runtime.api.SpecExecutor;
+import org.apache.gobblin.runtime.api.SpecProducer;
+import org.apache.gobblin.runtime.spec_executorInstance.InMemorySpecExecutor;
+import org.apache.gobblin.runtime.spec_executorInstance.MockedSpecExecutor;
+import org.apache.gobblin.service.modules.flowgraph.Dag;
+import org.apache.gobblin.service.modules.flowgraph.FlowGraphConfigurationKeys;
+import org.apache.gobblin.service.modules.orchestration.DagActionStore;
+import 
org.apache.gobblin.service.modules.orchestration.DagManagementStateStore;
+import org.apache.gobblin.service.modules.orchestration.DagManagerMetrics;
+import org.apache.gobblin.service.modules.spec.JobExecutionPlan;
+import org.mockito.Mockito;
+import org.testng.annotations.BeforeTest;
+import org.testng.annotations.Test;
+
+public class DagProcUtilsTest {
+
+  DagManagementStateStore dagManagementStateStore;
+  SpecExecutor mockSpecExecutor;
+
+  @BeforeTest
+  public void setUp() {
+    dagManagementStateStore = Mockito.mock(DagManagementStateStore.class);
+    mockSpecExecutor = new MockedSpecExecutor(Mockito.mock(Config.class));
+  }
+
+  @Test
+  public void testSubmitNextNodesSuccess() throws URISyntaxException, 
IOException {
+    Dag.DagId dagId = new Dag.DagId("testFlowGroup", "testFlowName", 2345678);
+    List<JobExecutionPlan> jobExecutionPlans = getJobExecutionPlans();
+    List<Dag.DagNode<JobExecutionPlan>> dagNodeList = 
jobExecutionPlans.stream()
+        .map(Dag.DagNode<JobExecutionPlan>::new)
+        .collect(Collectors.toList());
+    Dag<JobExecutionPlan> dag = new Dag<>(dagNodeList);
+    
Mockito.doNothing().when(dagManagementStateStore).addJobDagAction(Mockito.anyString(),
 Mockito.anyString(), Mockito.anyLong(), Mockito.anyString(), Mockito.any());
+    DagProcUtils.submitNextNodes(dagManagementStateStore, dag, dagId);
+    for (JobExecutionPlan jobExecutionPlan : jobExecutionPlans) {
+      Mockito.verify(dagManagementStateStore, Mockito.times(1))
+          .addJobDagAction(jobExecutionPlan.getFlowGroup(), 
jobExecutionPlan.getFlowName(),
+              jobExecutionPlan.getFlowExecutionId(), 
jobExecutionPlan.getJobName(),
+              DagActionStore.DagActionType.REEVALUATE);
+    }
+

Review Comment:
   definitely an improvement on verification!  you forgot 
`Mockito.verifyNoMoreInteractions(DMSS)`
   
   (please add to other two tests too)



##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/proc/DagProcUtils.java:
##########
@@ -139,12 +138,15 @@ public static void 
submitJobToExecutor(DagManagementStateStore dagManagementStat
       dagManagementStateStore.updateDagNode(dagNode);
       sendEnforceJobStartDeadlineDagAction(dagManagementStateStore, dagNode);
     } catch (Exception e) {
-      TimingEvent jobFailedTimer = 
DagProc.eventSubmitter.getTimingEvent(TimingEvent.LauncherTimings.JOB_FAILED);
       String message = "Cannot submit job " + 
DagUtils.getFullyQualifiedJobName(dagNode) + " on executor " + specExecutorUri;
       log.error(message, e);
-      jobMetadata.put(TimingEvent.METADATA_MESSAGE, message + " due to " + 
e.getMessage());
-      if (jobFailedTimer != null) {
-        jobFailedTimer.stop(jobMetadata);
+      // Only mark the job as failed in case of non transient exceptions
+      if (!DagProcessingEngine.isTransientException(e)) {
+        TimingEvent jobFailedTimer = 
DagProc.eventSubmitter.getTimingEvent(TimingEvent.LauncherTimings.JOB_FAILED);
+        jobMetadata.put(TimingEvent.METADATA_MESSAGE, message + " due to " + 
e.getMessage());
+        if (jobFailedTimer != null) {
+          jobFailedTimer.stop(jobMetadata);
+        }
       }

Review Comment:
   if you dont' rethrow here, would we ever get to the `catch` block of 
`DagProcessingEngine` L160?



##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagProcessingEngine.java:
##########
@@ -149,6 +158,16 @@ public void run() {
           dagTask.conclude();
           log.info(dagProc.contextualizeStatus("concluded dagTask"));
         } catch (Exception e) {
+          log.error("DagProcEngineThread: " + 
dagProc.contextualizeStatus("error"), e);
+          if (!DagProcessingEngine.isTransientException(e)) {
+            DagActionStore.DagAction dagAction = dagTask.getDagAction();
+            if (dagAction != null) {

Review Comment:
   unnecessary `null` check



##########
gobblin-service/src/test/java/org/apache/gobblin/service/modules/orchestration/DagProcessingEngineTest.java:
##########
@@ -190,15 +190,20 @@ public void dagProcessingTest()
     // (MAX_NUM_OF_TASKS + 1) th call
     int expectedNumOfInvocations = MockedDagTaskStream.MAX_NUM_OF_TASKS + 
ServiceConfigKeys.DEFAULT_NUM_DAG_PROC_THREADS;
     int expectedExceptions = MockedDagTaskStream.MAX_NUM_OF_TASKS / 
MockedDagTaskStream.FAILING_DAGS_FREQUENCY;
-    int expectedNonRetryableExceptions = MockedDagTaskStream.MAX_NUM_OF_TASKS 
/ MockedDagTaskStream.FAILING_DAGS_WITH_NON_RETRYABLE_EXCEPTIONS_FREQUENCY;
 
     AssertWithBackoff.assertTrue(input -> 
Mockito.mockingDetails(this.dagTaskStream).getInvocations().size() == 
expectedNumOfInvocations,
         10000L, "dagTaskStream was not called " + expectedNumOfInvocations + " 
number of times. "
             + "Actual number of invocations " + 
Mockito.mockingDetails(this.dagTaskStream).getInvocations().size(),
         log, 1, 1000L);
-
+    // Currently we are treating all exceptions as non retryable and 
totalExceptionCount will be equal to count of non retryable exceptions
     
Assert.assertEquals(dagManagementStateStore.getDagManagerMetrics().dagProcessingExceptionMeter.getCount(),
  expectedExceptions);
-    
Assert.assertEquals(dagManagementStateStore.getDagManagerMetrics().dagProcessingNonRetryableExceptionMeter.getCount(),
  expectedNonRetryableExceptions);
+    
Assert.assertEquals(dagManagementStateStore.getDagManagerMetrics().dagProcessingNonRetryableExceptionMeter.getCount(),
  expectedExceptions);
+  }
+
+  @Test
+  public void isNonTransientExceptionTest(){
+    Assert.assertTrue(!DagProcessingEngine.isTransientException(new 
RuntimeException("Simulating a non retryable exception!")));
+    Assert.assertTrue(!DagProcessingEngine.isTransientException(new 
AzkabanClientException("Simulating a retryable exception!")));

Review Comment:
   could use clarifying comment that there's nothing intrinsically transient or 
not about these, and in fact it just comes down to config, which is not 
presently provided.  
   
   even better however, would be to actually pass some test config and then 
demonstrate that the impl accordingly judges some as transient and others not.  
BTW, in authoring such a test you may find that it's challenging to have 
`isTransientException` as a `static` method, and yet be configuration-driven





Issue Time Tracking
-------------------

    Worklog Id:     (was: 948317)
    Time Spent: 2h 20m  (was: 2h 10m)

> Non transient exception handling by flowspec removal
> ----------------------------------------------------
>
>                 Key: GOBBLIN-2181
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-2181
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Vaibhav Singhal
>            Priority: Major
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> - Many times we experience failures in flow initialization or processing due 
> to which flow could not be concluded properly
>  - Azkaban client exceptions and SQLIntegrityViolation exceptions are 
> examples which have caused failures in recent history
>  - Currently most of these failures are by default considered transient 
> exceptions and are retried infinitely
>  - As a side effect, it causes flows not to conclude and causes failures in 
> future flow submissions which have caused incidents recently
>  
>  - As a first step we want to consider all exceptions as non transient and 
> not retry and remove conclude the flow by removing flowspec and dag action
>  - This issue tracks the changes to conclude the flow for non transient 
> exceptions and also mark them as failure to reflect the correct status of the 
> flow



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to