[jira] [Work logged] (GOBBLIN-1800) GaaS does not retry SLA killed jobs

ASF GitHub Bot (Jira) Fri, 17 Mar 2023 16:01:04 -0700


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1800?focusedWorklogId=851578&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-851578
 ]


ASF GitHub Bot logged work on GOBBLIN-1800:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 17/Mar/23 22:59
            Start Date: 17/Mar/23 22:59
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on code in PR #3661:
URL: https://github.com/apache/gobblin/pull/3661#discussion_r1140794314


##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -853,27 +852,40 @@ private boolean 
killJobIfOrphaned(DagNode<JobExecutionPlan> node, JobStatus jobS
             DagManagerUtils.getFullyQualifiedDagName(node),
             timeOutForJobStart);
         dagManagerMetrics.incrementCountsStartSlaExceeded(node);
-        cancelDagNode(node);
 
         String dagId = DagManagerUtils.generateDagId(node).toString();
-        
this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED);
+        
this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);

Review Comment:
   This one seems like a discrepancy, if we want to retry because job start sla 
exceeded, we will set dagNode with pending_retry but flow status to be 
flow_cancelled? 



##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -911,9 +923,8 @@ private boolean slaKillIfNeeded(DagNode<JobExecutionPlan> 
node) throws Execution
             
node.getValue().getJobSpec().getConfig().getString(ConfigurationKeys.FLOW_NAME_KEY),
 flowSla,
             
node.getValue().getJobSpec().getConfig().getString(ConfigurationKeys.JOB_NAME_KEY));
         dagManagerMetrics.incrementExecutorSlaExceeded(node);
-        cancelDagNode(node);
 
-        
this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_RUN_DEADLINE_EXCEEDED);
+        
this.dags.get(dagId).setFlowEvent(TimingEvent.FlowTimings.FLOW_CANCELLED);

Review Comment:
   Same here





Issue Time Tracking
-------------------

    Worklog Id:     (was: 851578)
    Time Spent: 50m  (was: 40m)

> GaaS does not retry SLA killed jobs
> -----------------------------------
>
>                 Key: GOBBLIN-1800
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1800
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: gobblin-service
>            Reporter: William Lo
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Gobblin-as-a-Service fails jobs when they run past their start SLA and their 
> runtime SLA. It would be expected for jobs to have these SLAs retried if 
> configured to retry, but they currently do not.
> The DagManager should automatically retry jobs that exceed their SLAs if the 
> user configured retries, in case these flow failures are due to intermittent 
> issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (GOBBLIN-1800) GaaS does not retry SLA killed jobs

Reply via email to